

# **Atlas Processor Core** by Stephan Nolting



#### **Proprietary Notice**

ARM is a trademark of Advanced RISC Machines Ltd.
AVR is a trademark of Atmel Corporation.
Xilinx ISE, Spartan and Xilinx ISIM are trademarks of Xilinx, Inc.
Quartus II and Cyclone are trademarks of Altera Corporation.
ModelSim is a trademark of Mentor Graphics, Inc.
Windows is a trademark of Microsoft Corporation.

#### Disclaimer

This project comes with no warranties at all. The reader assumes responsibility for the use of any kind of information from this documentary or the Atlas Processor Core project itself.

The ATLAS Processor Core was created by Stephan Nolting. For any kind of feedback, feel free to contact me: <a href="mailto:stnolting@gmail.com">stnolting@gmail.com</a>

The most recent version of the ATLAS Processor Core and it's documentary can be found at <a href="http://www.opencores.com/project,?????">http://www.opencores.com/project,?????</a>?

# **Table of Content**

| 1. Introduction                             | 5        |
|---------------------------------------------|----------|
| 1.1. Core Features.                         |          |
| 1.2. Project Overview.                      |          |
| 1.3. VHDL File Hierarchy.                   |          |
| 2. Core Signal Description.                 |          |
| 2.1. Atlas CPU Interface.                   |          |
| 2.2. Atlas Processor Interface              |          |
| 3. Programmer's Model                       |          |
| 3.1. Operating Modes.                       |          |
| 3.2. Exceptions and Interrupts              |          |
| 3.3. Data Registers                         |          |
| 3.4. Coprocessors                           |          |
| 3.5. Machine Status Register                |          |
| 3.6. Memory Model                           |          |
| 3.6.1. Virtual Address Extension.           |          |
| 3.7. Program Counter.                       |          |
| 4. Instruction Set                          |          |
| 4.1. Data Processing.                       |          |
| 4.1.1. User Register Bank Access            |          |
| 4.1.2. Program Counter Access.              |          |
| 4.1.3. Machine Status Register Access.      |          |
| 4.2. Memory Access.                         |          |
| 4.3. Branch and Link                        |          |
| 4.4. Load Immediate                         |          |
| 4.5. Bit Manipulation.                      |          |
| 4.6. Coprocessor Data Processing.           |          |
| 4.7. Coprocessor Data Transfer              |          |
| 4.8. Multiply-and-Accumulate                |          |
| 4.9. Undefined Instructions.                |          |
| 4.10. System Call                           |          |
| 5. The Atlas Evaluation Assembler.          |          |
| 5.1. Pre-Processor Instructions             |          |
| 5.2. Programming & Simulating the Processor |          |
| 5.3. Example Programs.                      |          |
| 5.3.1. Bit Test                             |          |
| 5.3.2. Comparing Large Operands             | 42       |
| 5.3.3. Loop Counters                        | 42       |
| 5.3.4. MAC Operation with Flag Update       |          |
| 5.3.5. Branch Tables                        |          |
| 6. Core Architecture                        |          |
| 6.1. Module Description.                    |          |
| 6.2. Data Path                              |          |
| 6.3. Data Registers                         |          |
| 6.3. Pipeline.                              |          |
| 6.3.1. Local Pipeline Conflicts.            |          |
| 6.3.2. Temporal Pipeline Conflicts.         |          |
| 6.3.2.1. MSR Write Access                   |          |
| 6.3.4 Branches                              | 50<br>50 |

# by Stephan Nolting

| 6.3.5. Exceptions and Interrupts | 50 |
|----------------------------------|----|
| 6.4. Interfaces                  |    |
| 6.4.1. Memory Interface          | 51 |
| 6.4.2. Wishbone Interface.       |    |
| 6.4.3. Coprocessor Interface     | 53 |
| 6.5. System Coprocessor (MMU)    |    |
| 6.6. Main Control Bus.           |    |

4

## 1. Introduction

Welcome to the ATLAS Processor project!

In contrast to the STORM CORE Processor, the Atlas processor was completely designed on the paper before I wrote the first line of VHDL. Of course, several good ideas evolved during the coding process, but however, the Atlas processor was the first project, where I really intended to start from scratch and create a small and powerful processor core rather than doing a big research project (like the STORM CORE was).

I've come a long way and due to my work with famous processors like ARM (I really love ARM!), DLX and AVR – and of course also with the STORM CORE –, I gathered a lot of ideas what a cool processor architecture might look like. So I combined some of the features from these architectures together with a lot of coffee to create a CPU, that really measures up to all my – and hopefully someone's else – expectations.

I hope you can also feel the beauty of this architecture (yeah, I'm really proud of this processor – even if this sounds strange in a nerdy way...) when working with the CPU;).

So, have fun with the Atlas processor!

#### 1.1. Core Features

- ✓ 16-bit RISC open source soft-core processor
- ✓ Small outline CPU-only and complex 32-bit addressing main processor implementations available
- ✓ Completely described in behavioral VHDL
- ✓ Pipelined instruction execution in 5 stages
- ✓ Single cycle execution of all instructions (except for branches, multi-cycle operations and TDDs¹)
- ✓ Four forwarding units to accelerate internal operand fetch
- ✔ Powerful memory access and bit manipulation instructions
- ✓ Two different operating modes with unique register sets (8 registers each) and privileges
- ✓ Full hardware support for emulating privileged-mode programs (like operating systems) in unprivileged-mode
- ✓ Support for tagged system call operations (software interrupts)
- ✓ Two external interrupt request signals
- ✔ Configurable internal cache\* (shared for instructions and data; fully associative)
- ✓ Configurable direct accessed address area to bypass cache (used for shared memories, IO devices, ...)
- ✓ Interface for two external coprocessors to extend the processor's functionality and instruction set
- ✓ Wishbone-compatible pipelined bus interface\* supporting burst-transfers
- ✓ Implementation synthesis results (speed optimized) on a Xilinx Spartan-3 XC3S400A FPGA
  - → CPU only: 82Mhz operating frequency at 13% device utilization (479 slices)
  - → Complete processor (CPU, cache², Wishbone bus interface, address extension coprocessor)\*: 70Mhz operating frequency at 29% device utilization (~1050 slices)

\*) Only available when using the complete Atlas processor (not the CPU-only implementation)

<sup>1</sup> TDDs = Temporal data dependencies (processing data, that has not been fetched from the memory yet)

<sup>2</sup> Default configuration: 4 cache pages, 64 byte page size, non-cached area starting at \$FF000000

## 1.2. Project Overview

The Atlas project was created to be a general purpose processor for applications, which require minimal hardware resources while providing a maximum of functionality and processing power. Due to it's minimal outlines, it is eminently suited for control applications of any kind

Two different implementation schemes of the CPU are intended: For small applications, the CPU core can be used alone. It only needs to be connected to separated or shared data and instruction memories. In this case, any user hardware modules should be connected via the coprocessor interface, to maximize the data transfer bandwidth. Thus, a small microcontroller-like system, that perfectly meets the requirements of the application, can be created. This system setup will be called the Atlas *CPU*.

The other implementation scheme, the so-called Atlas *processor*, already implements the CPU core itself together with a fully associative shared data and instruction cache, and memory management unit to extend the accessible memory/IO area and a Wishbone-compatible bus interface. This setup is very well suited to use the Atlas processor as main processor for larger and more complex system on chips designs, including multi-core structures or large-scale processing arrays.

## 1.3. VHDL File Hierarchy

All necessary hardware description files are located in the project's *rtl* folder. The top entity of the CPU is *ATLAS\_CORE.vhd*. The top entitiy of the complete processor, including a cache, a memory management unit and the Wishbone bus interface, is *ATLAS\_PROCESSOR.vhd*.

ATLAS PROCESSOR.vhd → Processor's top entity - BUS INTERFACE.vhd → Cache and Wishbone bus interface - MMU.vhd → Virtual address extension controller - ATLAS CORE.vhd  $\rightarrow$  CPU's top entity - ATLAS PKG.vhd → Atlas project package file - ALU. vhd → Arithmetical/logical unit, CP interface - CTRL.vhd → CPU control system - MEM ACC.vhd → Data memory access system - OP DEC.vhd → Opcode decoder - REG FILE.vhd → Data register file - SYS REG.vhd → Machine control register (PC and MSR) - WB UNIT.vhd → Data write-back unit

## 2. Core Signal Description

These chapters give a brief overview of the signal ports of the CPU top entity (*ATLAS\_CORE.vhd*) and the complete processor top entity (*ATLAS\_PROCESSOR.vhd*), which also includes the cache, the bus interface and arbitration control logic. The type of all signals/generics is **std\_logic\_vector**, respectively.

## 2.1. Atlas CPU Interface

The following table presents the interface ports of the Atlas CPU top entity (ATLAS\_CPU.vhd).

| Signal name | Width (#bits) | Direction | Function                                                                                        |  |
|-------------|---------------|-----------|-------------------------------------------------------------------------------------------------|--|
| CLK_I       | 1             | IN        | Global clock line, all registers trigger on the rising edge, 50% duty cycle                     |  |
| RST_I       | 1             | IN        | Global reset signal, synchronized to CLK_I and high-active                                      |  |
| HOLD_I      | 1             | IN        | Global halt signal, high-active, stops the processor, only change on falling edge of CLK_I      |  |
| INSTR_ADR_O | 16            | OUT       | New instruction address (= PC)                                                                  |  |
| INSTR_DAT_I | 16            | IN        | New instruction input, must be synchronized to CLK_I                                            |  |
| INSTR_EN_O  | 1             | OUT       | Instruction update enabled when '1', INSTR_DAT_I must not change when this signal is set to '0' |  |
| SYS_MODE_O  | 1             | OUT       | Current processor operating mode (0: user mode, 1: system mode)                                 |  |
| SYS_INT_O   | 1             | OUT       | Asserted when an interrupt has been triggered (internal or external)                            |  |
| MEM_REQ_O   | 1             | OUT       | Set to '1' when a data memory access is requested in the next cycle                             |  |
| MEM_RW_O    | 1             | OUT       | Memory read ('0') or write ('1') access                                                         |  |
| MEM_ADR_O   | 16            | OUT       | Memory access address                                                                           |  |
| MEM_DAT_I   | 16            | IN        | Memory write data                                                                               |  |
| MEM_DAT_O   | 16            | OUT       | Memory read data, must be synchronized to CLK_I                                                 |  |
| CP_CP0_EN_O | 1             | OUT       | Coprocessor 0 select                                                                            |  |
| CP_CP1_EN_O | 1             | OUT       | Coprocessor 1 select                                                                            |  |
| CP_OP_O     | 1             | OUT       | Coprocessor processing operation ('0') or data transfer ('1')                                   |  |
| CP_RW_O     | 1             | OUT       | Coprocessor read ('0') or write ('1') data transfer access                                      |  |
| CP_CMD_O    | 9             | OUT       | Coprocessor command, consisting of source/destination register and operation command            |  |
| CP_DAT_O    | 16            | OUT       | Coprocessor read data input for both coprocessors (OR-ed)                                       |  |
| CP_DAT_I    | 16            | IN        | Coprocessor write data                                                                          |  |
| EXT_INT_0_I | 1             | IN        | External interrupt line 0                                                                       |  |
| EXT_INT_1_I | 1             | IN        | External interrupt line 1                                                                       |  |

Table 1: Processor's CPU top entity interface ports

## 2.2. Atlas Processor Interface

The following table presents the signal specifications of the Atlas processor's top entity interface ports (ATLAS\_PROCESSOR.vhd). The provided bus interface is compatible to the Wishbone B4 specifications<sup>3</sup>.

| Signal name            | Width          | Direction | Function                                                                    |  |  |  |  |  |  |
|------------------------|----------------|-----------|-----------------------------------------------------------------------------|--|--|--|--|--|--|
| Configuration Generics |                |           |                                                                             |  |  |  |  |  |  |
| UC_AREA_BEGIN_G        | 32             | -         | First address of not-cached memory/IO area                                  |  |  |  |  |  |  |
| UC_AREA_END_G          | 32             | -         | Last address of not-cached memory/IO area                                   |  |  |  |  |  |  |
|                        | Global Control |           |                                                                             |  |  |  |  |  |  |
| CLK_I                  | 1              | IN        | Global clock line, all registers trigger on the rising edge, 50% duty cycle |  |  |  |  |  |  |
| RST_I                  | 1              | IN        | Global reset signal, synchronized to CLK_I, high-active                     |  |  |  |  |  |  |
|                        |                | U         | ser Coprocessor Interface                                                   |  |  |  |  |  |  |
| CP_EN_O                | 1              | OUT       | Access to external (user) coprocessor                                       |  |  |  |  |  |  |
| CP_OP_O                | 1              | OUT       | Data transfer/ processing operation                                         |  |  |  |  |  |  |
| CP_RW_O                | 1              | OUT       | Read/write access                                                           |  |  |  |  |  |  |
| CP_CMD_O               | 9              | OUT       | Processing opcode and register addresses                                    |  |  |  |  |  |  |
| CP_DAT_O               | 16             | OUT       | Write data output                                                           |  |  |  |  |  |  |
| CP_DAT_I               | 16             | IN        | Read data input                                                             |  |  |  |  |  |  |
|                        |                |           | Interrupt Line                                                              |  |  |  |  |  |  |
| IRQ_I                  | 1              | IN        | External interrupt request line                                             |  |  |  |  |  |  |
|                        |                |           | Wishbone Bus Interface                                                      |  |  |  |  |  |  |
| WB_ADR_O               | 32             | OUT       | Bus access address                                                          |  |  |  |  |  |  |
| WB_CTI_O               | 3              | OUT       | Cycle type identifier                                                       |  |  |  |  |  |  |
| WB_SEL_O               | 2              | OUT       | Byte select (always "11" → full 16-bit word data quantity)                  |  |  |  |  |  |  |
| WB_TGC_O               | 3              | OUT       | Cycle tag                                                                   |  |  |  |  |  |  |
| WB_DATA_O              | 16             | OUT       | Write data output                                                           |  |  |  |  |  |  |
| WB_DATA_I              | 16             | IN        | Read data input                                                             |  |  |  |  |  |  |
| WB_WE_O                | 1              | OUT       | Read/write bus access                                                       |  |  |  |  |  |  |
| WB_CYC_O               | 1              | OUT       | Valid cycle identifier                                                      |  |  |  |  |  |  |
| WB_STB_O               | 1              | OUT       | Data strobe                                                                 |  |  |  |  |  |  |
| WB_ACK_I               | 1              | IN        | Acknowledge input                                                           |  |  |  |  |  |  |
| WB_HALT_I              | 1              | IN        | Hold bus access                                                             |  |  |  |  |  |  |

Table 2: Processor system's top entity interface ports



<sup>3</sup> A copy of the Wishbone specifications can be found in the *core/doc* folder.

## 3. Programmer's Model

The Atlas processor is a true 16-bit RISC architecture, providing different data register banks and privileges for the two operating modes. The accessible registers corresponding to the operating modes are shown in the figure below.



Figure 1: Operation modes and accessible registers

## 3.1. Operating Modes

Two different operation modes are supported by the Atlas processor. The privileged mode is called "system mode", where the unprivileged one is called "user mode". After a hardware reset, the core always starts execution in system mode with full privileges. After program setup, the current processor mode can be switched to user mode to start an application, which requires limited privileges to keep the system's security. The program running in user mode can use system calls to request privileged operations, like direct hardware access. Furthermore, the user program can be interrupted by external interrupts at any time. In this case, the processor automatically switches back to system mode and resumes operation executing the corresponding interrupt handler. Due to hardware features, the context switches from user mode to system mode and back from system mode to user mode do not need any additional software handling.

<u>Note</u>: All instructions and operations, that are allowed in system mode, but are not allowed in user mode (like user bank transfers, accesses to a protected coprocessors or full MSR accesses) will trigger the software interrupt trap (system call alias). These hardware features allow to emulate a system mode program, like an operating system, in user mode. This is very suitable for the implementation of virtual machines, which are able to run complete operating system.

## 3.2. Exceptions and Interrupts

The Atlas CPU features four different interrupt or exception types. In famous books about computer architecture, "exceptions" refer to all kind of abnormal program interruptions, no matter what source they emerge from. "Interrupts" are a sub group of those exceptions, where the cause emerges from an external signal, like an interrupt request pin. However, in this documentary and in the hardware description files of the CPU, all kinds of abnormal program interruptions are called interrupts. The different types, their priority during execution, their option to be masked and the corresponding addresses of the interrupt handlers are listed in the table below.

| Priority    | Interrupt source                                                                    | Mask-able | Handler base address |
|-------------|-------------------------------------------------------------------------------------|-----------|----------------------|
| 1 (highest) | Hardware reset                                                                      | No        | x"0000"              |
| 2           | External interrupt signal 0 (EXT_INT_0_I)                                           | Yes       | x"0001"              |
| 3           | External interrupt signal 1 (EXT_INT_1_I)                                           | Yes       | x"0002"              |
| 4 (lowest)  | Software interrupt (SYSCALL instruction, access violations, undefined instructions) | No        | x"0003"              |

Table 3: Interrupt vector address and priority list

Whenever a valid interrupt condition occurs, the processor stops execution, enters system mode and resumes operation at the corresponding interrupt handler base address. These base addresses are fixed in hardware and only one word separates the different interrupt vectors. Thus, a branch instruction to the final handler, or a branch to an intermediate handler, which loads the address of the final handler) must be inserted into the interrupt vector slots. Furthermore, the return address is automatically stored to the link register.

## 3.3. Data Registers

Each operating mode has direct access to a mode-depended set of eight 16-bit registers. When changing modes (context switch), no storing of the registers on the stack is necessary, since the hardware changes the accessible register bank corresponding to the new operation mode automatically. When in privileged system mode, all of the 16 register can be accessed, but only 8 of them – the actual system mode registers – can be used for data processing or transfer operations. The remaining 8 user mode registers must be accessed via special instructions and their data has to be moved to a system mode register before performing any data manipulation.

## 3.4. Coprocessors

Up to two external coprocessors can be attached to the Atlas CPU to extend the functionality and the instruction set of the processor core. By default, coprocessor 1 is already included within the processor system and represents a bus access controller, that is capable of extending the accessible memory space to 4GB. The coprocessor 0 slot can be used by the system designer to attach custom logic to the Atlas processor. This coprocessor slot is disabled by default, so any access will be unsuccessful and triggers the software interrupt. The slot can be enabled via special configuration constants in the CPU's VHDL package file. Both coprocessors can be accessed by special coprocessor instructions. These instructions are separated into two classes: The first classes is used for transferring data from a CPU register to a coprocessor and the other way around. The other class only effects the coprocessor and it's registers and is meant to perform data processing operations directly on the processors. Coprocessor 1 is the "system coprocessor" and thus can only be accessed in system mode. Coprocessor slot 0 can also be accessed in user mode, but if necessary, the access can be restricted to user mode by setting the protection flag in the machine status register. Any attempt to access a protected coprocessor in user mode will trigger the software interrupt trap.

## 3.5. Machine Status Register

The machine status register, abbreviated as MSR, holds the global control flags as well as the the CPU's ALU flags. The different flags and flag sets of the MSR are shown in the figure below.



Figure 2: Machine Status Register

The flags, which are used by the arithmetical/logical unit and the condition computing unit, are located in the lowest 10 bit of the machine status register. There are two identical sets of the ALU processing flags. Together they are called "ALU flags" One set is used when in system mode ("system ALU flags"), the other is used by programs in user mode ("user ALU flags"). Each set holds information about the result of the previous data processing operations. These flags can be automatically updated after a data processing operations when using a specific suffix for the corresponding mnemonics. Otherwise, the flags are not altered. The name, location and functionality of the ALU flags is presented in the table below.

by Stephan Nolting

Negative flag (sign)

Transfer flag

| Flag name | ng name   Bit # in user mode   Bit # in system mode |   | Function      |
|-----------|-----------------------------------------------------|---|---------------|
| Z         | 0                                                   | 5 | Zero flag     |
| С         | 1                                                   | 6 | Carry flag    |
| О         | 2                                                   | 7 | Overflow flag |

Table 4: ALU flags for user / system mode

4

T

9

The zero flag (**Z**-flag) is always set whenever the operation result is zero. The most significant bit of the operation result (= the sign, when using two's complement representation) is copied to the negative flag (**N**-flag). The carry flag (**C**-flag) indicates a carry for an addition and subtraction or a direct data output of the shifter. The overflow flag (**O**-flag) is set whenever a range overflow during a two's complement arithmetical operation takes places. During a shift operation an overflow can occur when the sign bit of Ra gets changed. Logical operations do no alter the overflow or the carry flag. The transfer flag (**T**-flag) is not altered by any data processing operations and is used for bit test and transfer operations. All together, the ALU flag set of the current processor operation mode determines the condition for conditional branches.

The system control flags, located in the highest 6 bits of the MSR, are used to configure general CPU functions. The different flags, their location and their functionality are shown in the table below.

| Bit # | Flag name | Function                                    | Function When set to '0'                              |                                                 |  |
|-------|-----------|---------------------------------------------|-------------------------------------------------------|-------------------------------------------------|--|
| 10    | СР        | User coprocessor (coprocessor 0) protection | Coprocessor 0 can be accessed in user and system mode | Coprocessor can only be accessed in system mode |  |
| 11    | -         | reserved                                    | Leave this bit unchanged                              |                                                 |  |
| 12    | GX        | Global external interrupt enable            | Disable all external interrupts                       | Enable external interrupts                      |  |
| 13    | X0        | External interrupt line 0 mask              | Disable external interrupt line 0                     | Enable external interrupt line 0                |  |
| 14    | X1        | External interrupt line 1 mask              | Disable external interrupt line 1                     | Enable external interrupt line 1                |  |
| 15    | M         | Operating mode                              | Processor is in user mode                             | Processor is in system mode                     |  |

Table 5: System control flags

Bit 10 (**CP**-flag) is used to protect the "user" coprocessor (coprocessor 0) from being accessed in user mode. An unauthorized access in user mode will trigger the software interrupt trap.

The following three bits 12 to 14 (**GX-**, **X0-**, **X1-**flag) configure the two external interrupt lines. A global interrupt is valid and executed when the global interrupt enable flag (**GX-**flag) and the corresponding interrupt line mask flag (**X0** for  $EXT\_INT\_0$ , **X1** for  $EXT\_INT\_1$ ) are set to '1'. Whenever a valid external interrupt request occurs, the execution of the correlated handler is started. The global external interrupt enable flag is then automatically cleared and can be set to '1' again when returning from the interrupt handler routine. As the last bit of the MSR (bit 15), the **M-**flag determines the current operation mode of the CPU. A '1' indicates system mode and a '0' indicates user mode. This flag is automatically updated on context up-(user mode  $\rightarrow$  system mode, exceptions/interrupts) and down-switches (system mode  $\rightarrow$  user mode, e.g. return from exception/interrupt handler). However, it can also be manually set or cleared when operating in system mode.

The MSR can be accessed by special instructions to transfer the MSR content to a register or to store a register's content to the MSR. Also, a direct initialization of either the user mode or the system mode ALU flags with an immediate is possible. In system mode, the complete MSR, only the ALU flags or only the

ALU flags of a specific operation mode can be altered. In user mode, only a read or write access to the user mode ALU flags is allowed. When trying to alter or to read other bits (determined by actual read/write option) of the MSR from a user mode program, the software interrupt trap is taken.

## 3.6. Memory Model

A uniform and linear address space of  $2^{16} = 65536$  bytes is assumed by the Atlas CPU. However, the memory data bus is 16-bit wide, thus a word of 16 bit is transferred from or to the memory at one time. If a memory system is not capable of presenting a full word at one time, the memory manger has to halt the processor until it has assembled a full 16 bit word.

Data memory accesses can be performed on word boundaries (aligned access) or on unaligned addresses by using any register as pointer. When accessing unaligned addresses, the bytes of the transfer data are swapped. This feature is illustrated in the figure below. Note, that in this example big Endian mode is used. The actual Endianness of the CPU can be modified in the CPU's VHDL package file.



Figure 3: Memory accesses on aligned / unaligned word boundaries (hexadecimal data)

For the CPU-only implementation, the data memory can also be used as program memory implementing a Von-Neumann architecture to share data and instruction memory space. Of course, a Harvard-like architecture with separated memories for instruction and data is also possible. Instruction fetch accesses will always be performed on aligned addresses, therefore instruction opcodes must be placed at word boundaries.

#### 3.6.1. Virtual Address Extension

To extent the accessible memory space, the system coprocessor (coprocessor 1, MMU) of the Atlas processor implementation presents the functionality to separate an address space of 32-bit (4 GB) into 2<sup>16</sup> blocks of 2<sup>16</sup> bytes each. The block address (= the most significant 16 bits of the address) is generated by base address registers within the MMU, separated for instruction/data access in user and system mode. It's the task of the system mode program to handle the management of this different memory pages. The chapter about the rtl architecture of the processor will focus on the actual configuration options of the system coprocessor.

It might be quite complicated in some cases to adopt programs to the 65kB block size, but this was the easiest way to expand the memory area of the 16-bit processor core...;)

## 3.7. Program Counter

Both operating modes use the same program counter (PC). It can be accessed via special load/store operations. For calling subroutines, register 7 (R7) of the current register bank is used as link register (LR) to store the return address. Furthermore, the link register is used to store the re-entry point (return address) whenever an interrupt or exception occurs. For exceptions (interrupts caused by the software; system calls, undefined instructions or access violations), the return address points to the second instruction after the one, that has caused the exception. For interrupts (external interrupts via the interrupt lines), the link register points to the second instruction after that one, that has completed last before the interrupt occurred. In both cases, the link register has to decremented by two (bytes) to restore the actual return address or re-entry point, respectively.

## 4. Instruction Set

This chapter introduces the encoding and functional explanation of the implemented instruction set. The complete set is divided into several classes and sub-sets, combining several instructions of one type. All instructions are 16-bit wide and must be placed at word-aligned memory addresses.

A short summary of the Atlas instruction set is shown in the figure below.

|                         | 15 | 14 | 13 | 12 | 11         | 10 | 9 8 7               | 6 5 4  | 3   | 2 1 0  |
|-------------------------|----|----|----|----|------------|----|---------------------|--------|-----|--------|
| Data Processing         | 0  | 0  |    | CI | <b>1</b> D |    | Rd                  | Ra     | s   | Rb     |
| Load MSR to register    | 0  | 0  | 0  | 1  | 1          | 0  | Rd                  | A B 0  | 0   | 0 0 0  |
| Store register to MSR   | 0  | 0  | 0  | 1  | 1          | 1  | 0 0 0               | A B 0  | 0   | Rb     |
| Store I. to ALU flags   | 0  | 0  | 0  | 1  | 1          | 1  | 0 <u>T</u> <u>N</u> | 1 B 1  | 0   | O C Z  |
| Load PC to register     | 0  | 0  | 1  | 1  | 1          | 0  | Rd                  | 0 0 0  | 0   | 0 0 0  |
| Store register to PC    | 0  | 0  | 1  | 1  | 0          | 1  | 0 0 0               | Ra     | 0   | L I X  |
| Load reg from user bank | 0  | 0  | 1  | 0  | 0          | 1  | Rd_sys              | Ra_usr | s   | Ra_usr |
| Store reg to user bank  | 0  | 0  | 1  | 0  | 0          | 0  | Rd_usr              | Ra_sys | s   | Ra_sys |
| Memory Access           | 0  | 1  | P  | Ū  | W          | L  | Rd                  | Ra     | Ι   | Offset |
| Memory Swap             | 0  | 1  | 1  | 0  | 0          | 0  | Rd                  | Ra     | 0   | Rb     |
| Branch and Link         | 1  | 0  |    | Со | nd         |    | L                   | Offs   | et  |        |
| Load Immediate          | 1  | 1  | 0  | 0  | M          | I  | Rd                  | Imm    | edi | late   |
| Bit Manipulation        | 1  | 1  | 0  | 1  | M          | s  | Rd                  | Ra     |     | Bit    |
| Coprocessor Processing  | 1  | 1  | 1  | 0  | 0          | N  | Cd/Cb               | Ca     | _   | Cmd    |
| Coprocessor Transfer    | 1  | 1  | 1  | 0  | 1          | N  | Cd/Rd               | Ca/Ra  | L   | Cmd    |
| Multiplication          | 1  | 1  | 1  | 1  | 0          | 0  | Rd                  | Ra     | 0   | Rb     |
| Multiply-and-Accumulate | 1  | 1  | 1  | 1  | 0          | 0  | Rd                  | Ra     | 1   | Rb     |
| Undefined Instruction   | 1  | 1  | 1  | 1  | 0          | 1  |                     |        |     |        |
| Undefined Instruction   | 1  | 1  | 1  | 1  | 1          | 0  |                     |        |     |        |
| System Call             | 1  | 1  | 1  | 1  | 1          | 1  | Tag                 |        |     |        |
|                         | 15 | 14 | 13 | 12 | 11         | 10 | 9 8 7               | 6 5 4  | 3   | 2 1 0  |

Figure 4: Instruction set formats

by Stephan Nolting

## 4.1. Data Processing

The instruction encoding of the data processing instructions is shown in the figure below.



Figure 5: Data processing instructions format

This type of instructions performs an arithmetical or logical operation specified by the CMD bit-field on the two operand registers Ra and Rb and places the result in the destination register Rd (the binary operation codes for the CMD-field are specified in the table below). Some instructions only use register A (Ra) and manipulate it's content by an immediate coded with the three bits of the Rb bit-field. The instructions can be classified as logical (AND, NAND, ORR, EOR, BIC, TEQ, TST), arithmetical (ADD, ADC, SUB, SBC, IND, DEC, CMP, CPX) or shift (SFT) operations.

Whenever the S-bit is set by using an "S" as appendix to a data processing mnemonic, the carry, negative, zero and overflow flags (= the ALU flags corresponding to the current processor mode) are updated corresponding to the computation result. For test and compare instructions (TST, TEQ, CMP, CPX), the S-bit is always set, so the S-appendix is not required for the mnemonics. The assembler will automatically set the S-flag for this instructions. Furthermore, the Rd bit-field is not required for this type of instructions, since no computation data result is generated. Therefore, the Rd bit-field should be filled with zeros.

The extended compare instruction (CPX) can be used to compare larger words than 16-bit. Therefore, the CPX instruction subtracts operand A and operand B but takes also the carry and zero signal of the previous operation into account to compute the actual carry and zero flag result.

Most instructions combine the two operand registers to produce a result. The INC and DEC operations only use operand register A (Ra) and add or subtract a 3-bit immediate, which is encoded in the Rb bit-field. The shift (SFT) command uses this bit-field (Rb) to specify the type of shift operation, that is applied to Ra.

The assembler internal no-operation pseudo instruction (NOP) is formed from an increment on register 0 with a zero immediate and a cleared S-bit, resulting in no actual system state change. Thus, the binary coding of a NOP instruction is x"0000".

| Mnemonic | CMD  | Action                                                                    |
|----------|------|---------------------------------------------------------------------------|
| INC      | 0000 | Rd = Ra + 3-bit-immediate; immediate is formed from the Rb-bits           |
| DEC      | 0001 | Rd = Ra – 3-bit-immediate; immediate is formed from the Rb-bits           |
| ADD      | 0010 | Rd = Ra + Rb                                                              |
| ADC      | 0011 | Rd = Ra + Rb + Carry-Flag                                                 |
| SUB      | 0100 | Rd = Ra - Rb                                                              |
| SBC      | 0101 | Rd = Ra - Rb - Carry-Flag                                                 |
| CMP      | 0110 | Flags = Ra - Rb; result is not written to a register                      |
| CPX      | 0111 | Flags = Ra - Rb with old flags; result is not written to a register       |
| AND      | 1000 | Rd = Ra AND Rb                                                            |
| ORR      | 1001 | Rd = Ra OR Rb                                                             |
| EOR      | 1010 | Rd = Ra XOR Rb                                                            |
| NAND     | 1011 | Rd = Ra NAND Rb                                                           |
| BIC      | 1100 | Rd = Ra AND NOT Rb (bit clear)                                            |
| TEQ      | 1101 | Flags = Ra XOR Rb; result is not written to a register                    |
| TST      | 1110 | Flags = Ra AND Rb; result is not written to a register                    |
| SFT      | 1111 | Rd = shift(Rb); shift by one position; shift type is specified by Rb-bits |

Table 6: Data processing commands

When using the SFT (shift) instruction, the Rb bit-field encodes the actual shift functionality by an immediate value. Data of Ra is always shifted by one place in the corresponding direction. The eight different shift types are listed in the table below.

| Mnemonic | Rb[2:0] | Function                   | Data result              | Carry result   |
|----------|---------|----------------------------|--------------------------|----------------|
| #SWP     | 000     | Swap bytes                 | Rd = Ra [7:0] & Ra[15:8] | Carry = Ra[15] |
| #ASR     | 001     | Arithmetical right shift   | Rd = Ra[15] & Ra[15:1]   | Carry = Ra[0]  |
| #ROL     | 010     | Rotate left                | Rd = Ra[14:0] & Ra[15]   | Carry = Ra[15] |
| #ROR     | 011     | Rotate right               | Rd = Ra[0] & Ra[15:1]    | Carry = Ra[0]  |
| #LSL     | 100     | Logical left shift         | Rd = Ra[14:0] & '0'      | Carry = Ra[15] |
| #LSR     | 101     | Logical right shift        | Rd = '0' & Ra[15:1]      | Carry = Ra[0]  |
| #RLC     | 110     | Rotate left through carry  | Rd = Ra[14:0] & Carry    | Carry = Ra[15] |
| #RRC     | 111     | Rotate right through carry | Rd = Carry & Ra[15:1]    | Carry = Ra[0]  |

Table 7: Shift commands; note that '&' indicates a concatenation

## **Assembler Syntax**

Items in { } are optional, whereas items in < > are required. Note the spaces and commas introduced by the lexical rules.

```
1. INC, DEC (immediate operations)
<INC|DEC>{S} <Rd>, <Ra>, <#Imm>
```

2. CMP, CPX, TST, TEQ (compare/test operations, no result write-back to a register)

```
<CMP|CPX|TST|TEQ>{S} <Ra>, <Rb>
```

<SFT>{S} <Rd>, <Ra>, <#Shift>

```
3. ADD, ADC, SUB, SBC, AND, ORR, NAND, EOR, BIC (arithmetical / logical operations) <adD|ADC|SUB|SBC|AND|ORR|NAND|EOR|BIC>{S} <Rd>, <Ra>, <Rb>
```

4. SFT (shift operations)

```
Update processing flags corresponding to result when present.

Ra> Destination register.

Ra> Operand A register.

Rb> Operand B register.

Three bit wide immediate (0...7); with present #-prefix.

Shift type code, corresponding to the table above; with #-prefix.
```

#### **Assembler Examples**

```
INC R0, R1, #2 ; increment R1 by 2 and store result to R0
INCS R0, R1, #2 ; increment R1 by 2, set flags and store result to R0
NOP ; INC R0, R0, #0 = no operation
ADC R2, R5, R2 ; add R5 and R2 with carry and store result to R2
ORRS R3, R3, R4 ; logical or of R3 and R4, set flags
; and store result back to R3
SFT R1, R3, #ROL; rotate left R3 one position and store result to R1
CMP R2, R0 ; compare low words first, then
CPX R3, R4 ; compare high words to evaluate a 32-bit comparison
```

#### **Coding Examples**

The assembled instruction are shown in binary (0b ...) and hexadecimal (x"...") format, where the dots in the binary format present the different bit-fields.

```
INC R0, R1, #2 = 0b 00.0000.000.001.0.010 = x"0012"

INC R0, R1, #2 = 0b 00.0000.000.001.1.010 = x"001A"

NOP = 0b 00.0000.000.000.000 = x"0000"

ORRS R3, R3, R4 = 0b 00.1001.011.011.1.100 = x"25BC"

SFT R1, R3, #R0L = 0b 00.1111.001.011.0.010 = x"3CB2"
```

## 4.1.1. User Register Bank Access

The instruction encoding of the user register bank access subset instructions is shown in the figure below.



Figure 6: User register bank access instructions subset formats

Since there are no dedicated instructions to access the user register bank from a program in system mode, the access in encoded using a redundant form of the ORR and AND instructions. For Ra = Rb, these instruction are redundant, because the result is always Ra. Therefore the opcodes are reused to encode user bank transfers with the special mnemonics LDUB (load from user bank register) and STUB (store to user bank register).

The LDUB instruction uses the ORR binary format with Ra = Rb (= Ra\_usr) to load the user bank register Ra\_usr to system bank register Rd\_sys. Whereas STSR uses the binary format of AND with Ra = Rb (= Ra\_sys) to store the system bank register Ra\_sys to the user bank register Rd\_sys.

The transfer is only performed when executed in system mode. In user mode the load/store from/to user bank instructions will trigger the software interrupt.

#### **Assembler Syntax**

Items in { } are optional, whereas items in < > are required. Note the spaces and commas introduced by the lexical rules.

1. LDUB (load system bank register from user bank register)

```
<LDUB>{S} <Rd_sys>, <Ra_usr>
```

2. STUB (store system bank register to user bank register)

```
<STUB>{S} <Rd_usr>, <Ra_sys>
```

Update processing flags corresponding to result when present.

 $\verb| <Rd_sys> System bank destination register. |$ 

<Ra\_sys> System bank source register.

## **Assembler Examples**

```
LDUB R0, R4 ; load user bank register R4 to system bank register R0 STUB R3, R2 ; store system bank register R2 to user bank register R3 STUBS R2, R6 ; store system bank register R6 to user bank register R2 ; and set flags corresponding to the data in R6
```

## **Coding Examples**

The assembled instruction are shown in binary (0b ...) and hexadecimal (x"...") format, where the dots in the binary format present the different bit-fields.

```
LDUB R0, R4 = 0b 00.1001.000.100.0.100 = x"2444"

STUB R3, R2 = 0b 00.1000.011.010.0.010 = x"21A2"

STUBS R2, R6 = 0b 00.1000.010.110.1.110 = x"2166"
```

## 4.1.2. Program Counter Access

The instruction encoding of the program counter access subset instructions is shown in the figure below.



Figure 7: Program counter access instructions subset formats

Since there are no dedicated instructions to access the program counter (PC), the access is coded using the TEQ and TST instruction with a cleared S-bit. The mnemonics of these special instructions are LDPC (load from PC) and STPC (store to PC). Not all of the bit-fields are used for the transfer operations. Fill the unused bit-fields with zeros.

STPC stores Ra to the program counter. This results in a branch to the address stored in Ra. Therefor, this instruction can be used to implement absolute branches. Since Rb is not used in this case, the bit-field of Rb encodes three additional options (X, I, L) for storing the new PC value. These options are active when the corresponding bit is set. The different options are presented in the table below.

| Bit | Option bit name | Function, when bit is set ('1')                                              |  |  |  |  |  |
|-----|-----------------|------------------------------------------------------------------------------|--|--|--|--|--|
| 2   | L               | Save return address (PC+2bytes) to link register (LR)                        |  |  |  |  |  |
| 1   | I               | Set global external interrupt enable flag, only allowed when in system mode! |  |  |  |  |  |
| 0   | X               | Change operation mode to 'user mode', only allowed when in system mode!      |  |  |  |  |  |

Table 8: PC store options

If bit 0 (X) is set, the processor will resume operation in user mode at the address stored in Ra. This functionality can be used to return from a system mode program (e.g. interrupt handler) to restore operation in user mode. When bit 1 (I) is set, the global interrupt enable flag will be set. Therefore this option is useful to re-enable external interrupt after an external interrupt handler has finished. Both options will only have an effect when executed in system mode. Otherwise these options are ignored or irrelevant, respectively. Bit 2 (I) is set whenever the return address (I) by the I by the I option is allowed for programs in user mode. The I and I0 options will trigger the software interrupt trap when executed in user mode.

Note, that there are three different mnemonics for the STPC (store register to program counter) instruction. All of them perform the same operation and support the previously mentioned options. The three different aliases (STPC, RET, GT) are just used to make the actual intention of an instruction more clear (e.g. RET for a return from subroutine...).

The LDPC instruction will load the current program counter minus 4 bytes (this corresponds to the actual address of the executed LDPC instruction) to register Rd.

#### **Assembler Syntax**

Items in { } are optional, whereas items in < > are required. Note the spaces and commas introduced by the lexical rules.

1. LDPC (load PC to register)

```
<LDPC> <Rd>
```

2. STPC/RET/GT (Three different mnemonics for the same operation: Store register to PC)

```
<STPC|RET|GT>{X}{I}{L} <Ra>
```

{X} Change to user mode when present (and executed in system mode).
 {I} Set global external interrupt flag when present (and executed in system mode).
 {L} Save return address (PC + 2 bytes) to link register when present.
 <Rd> Destination register.
 <Ra> Source register.

#### **Assembler Examples**

```
LDPC R0
           ; load PC to R0
STPC R7
           ; store R7 to PC (absolute jump to [R7])
RET
     R7
           ; store R7 to PC (just another mnemonic)
GT
     R7
           ; store R7 to PC (just another mnemonic)
RETX R7
           ; store LR to PC and switch to user mode (e.g. return from
           ; software interrupt handler)
RETXI R7
           ; store LR to PC, switch to user mode and set global external
           ; interrupt enable flag (e.g. return from ext. int. handler)
GTX R2
          ; store R2 to PC and change to user mode
GTI
     R2
          ; store R2 to PC and set global external interrupt flag
GTL
     R2
          ; store R2 to PC and store return address to LR
GTXL R3
           ; store R3 to PC, change to user mode and store return address
           ; to LR
GTIL R3
           ; store R3 to PC, set global external interrupt flag and store
           ; return address to LR
GTXIL R3
          ; store R3 to PC, switch to user mode, set global external
           ; interrupt flag and store return address to LR
```

#### **Coding Examples**

The assembled instruction are shown in binary (0b ...) and hexadecimal (x"...") format, where the dots in the binary format present the different bit-fields.

```
LDPC R0 = 0b 00.1110.000.000.0.000 = x"3800"

RETXI R7 = 0b 00.1101.000.111.0.0.1.1 = x"3473"

GTXIL R3 = 0b 00.1101.000.011.0.1.1.1 = x"3437"
```

## 4.1.3. Machine Status Register Access

The instruction encoding of the machine status register access subset instructions is shown in the figure below.



Figure 8: Machine status register access instructions subset formats

Since there are no dedicated instructions to access the machine status register (MSR), the access in encoded using the CMP and CPX instruction with a cleared S-bit. These mnemonics of these special instructions are LDSR (load register from MSR), STSR (store register to MSR) and STAF (store immediate to MSR's ALU flags). The LDSR instruction uses the CMP binary format with S='0' and will load the current MSR to Rd. Whereas STSR and STAF use the binary format of CPX with S='0' to store Rb or an immediate to the MSR. Not all of the bit-fields are used for the transfer operations. Fill the unused bit-fields with zeros.

Corresponding to the option bits (A, B), data can be written to the complete MSR, only to the ALU flags (user and system ALU flags), only to the system ALU flags or only to the user ALU flags. In user mode, only the user mode ALU flags can be copied to a register (all other bits are set to zero) and only a store to the user ALU flags can be executed. All other options will trigger the software interrupt when being executed in user mode. In system mode, all different load and store options are allowed. These different options and their behavior in user/system mode when executing LDSR or STSR instruction are shown in the table below.

| A-bit | B-bit | Mode   | READ access (LDSR)         | STORE access (STSR)         | Software Interrupt |
|-------|-------|--------|----------------------------|-----------------------------|--------------------|
| 0     | 0     |        | Read complete MSR          | Write complete MSR          | No                 |
| 0     | 1     | System | Only read all ALU flags    | Only write all ALU flags    | No                 |
| 1     | 0     | mode   | Only read system ALU flags | Only write system ALU flags | No                 |
| 1     | 1     |        | Only read user ALU flags   | Only write user ALU flags   | No                 |
| 0     | 0     |        | Unauthorized access!       | Unauthorized access!        | Yes!               |
| 0     | 1     | User   | Unauthorized access!       | Unauthorized access!        | Yes!               |
| 1     | 0     | mode   | Unauthorized access!       | Unauthorized access!        | Yes!               |
| 1     | 1     |        | Only read user ALU flags   | Only write user ALU flags   | No                 |

*Table 9: MSR store options and mode corresponding behavior* 

The STAF instruction is used to directly copy an immediate encoded within the instruction either to the system mode ALU flags or to the user mode ALU flags only. The  $\underline{T}$ ,  $\underline{N}$ ,  $\underline{O}$ ,  $\underline{C}$ ,  $\underline{Z}$  bit-fields correlate to the new value the user/system mode ALU flags will be set to. Note, that option bit A must be set to '1' for STAF operations. Option bit B encodes if the immediate flag data is written to the system mode ALU flags (B = '0') or to the user mode ALU flags (B = '1').

#### **Assembler Syntax**

Items in { } are optional, whereas items in < > are required. Note the spaces and commas introduced by the lexical rules.

1. LDSR (load register from machine status register)

```
<LDSR> <Rd>, {usr flags|sys flags|alu flags}
```

2. STSR (store register to machine status register)

```
<STSR> <Rb>, {usr flags|sys flags|alu flags}
```

3. STAF (store immediate to system / user ALU flags)

```
<STAF> <#Imm>, <usr_flags|sys_flags>
```

```
<Rd>
<Rb>
<#Imm>
{usr_flags|sys_flags|alu_flags}

<usr flags|sys flags>
```

Destination register.

Source register.

Five bit immediate, corresponding to u/s ALU flags. Write user ALU flags, system ALU flags, all ALU flags or full MSR, when no argument is present. Write user ALU flags or system ALU flags.

#### **Assembler Examples**

```
LDSR R1, usr_flags ; load MSR ALU flags to R1
STSR R3 ; store R3 to MSR (full access)
STSR R4, usr_flags ; only write R4 to the user mode ALU flags
STAF #1, usr_flags ; set carry flag of the user mode ALU flags
```

#### **Coding Examples**

The assembled instruction are shown in binary (0b ...) and hexadecimal (x"...") format, where the dots in the binary format present the different bit-fields.

```
LDSR R1, usr_flags = 0b 00.0110.001.110.0.000 = x"18E0"

STSR R3 = 0b 00.0111.000.011.0.000 = x"1830"

STSR R4, usr_flags = 0b 00.0111.110.100.0.000 = x"1E40"

STSR R4, alu_flags = 0b 00.0111.010.100.0.000 = x"1A40"

STAF #1, usr_flags = 0b 00.0111.000.111.0.001 = x"1A71"
```

## 4.2. Memory Access

The instruction encoding of the memory access instructions is shown in the figure below.



Figure 9: Memory access instructions formats

The memory access instructions allow to move data between a data register and an addressed memory location. Ra always specifies a register, pointing to the accessed memory address. The L-bit determines the data transfer direction. When L is set to '1', the content of Rd is transferred to the memory location addressed (STR) by Ra. If the L-bit is set to '0', data from the assigned memory address is loaded into the register (LDR), that is specified by the Rd bit-field.

Several different indexing options are implemented. To the memory base address (in Ra), an offset can be added or subtracted (U = '0' subtract, U = '1' add) before or after the actual memory access. Setting the P-bit to '0' will add/subtract the offset before the memory access. When the P-bit is set, the offset will be added/subtracted from or to the base register after the memory access. The result of the operation base +/- offset can be written back to the base register Ra when the W-bit is set. The actual offset can either be a register (I = '0') or a unsigned 3-bit immediate (I = '1').

| Option bit | Function when set to '0'                                                        | Function when set to '1'                                                                     |
|------------|---------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------|
| P          | Pre-indexing (add/subtract offset to/from base <b>before</b> the memory access) | Post-indexing (add/subtract offset to/from base <b>after</b> the memory access)              |
| U          | Subtract offset from base register                                              | Add offset to base register                                                                  |
| W          | Discard result of base+/- offset after memory access                            | Write back the result of base +/- offset to the base register after the actual memory access |
| L          | Load data from memory into a register                                           | Store data from a register to memory                                                         |
| I          | Offset is a register specified in the offset bit-field                          | Offset is an unsigned 3-bit <b>immediate</b> specified in the offset bit-field               |

Table 10: Memory access options

One kind of indexing option does not seem logical: A post indexing without a base write back (P = '1' and W = '0'). Here, the post indexing operation is redundant. Therefore, this type of option code is used to specify a new memory access instruction: The atomic memory data swap (SWP). This instruction copies the data of the memory location, which is specified by Ra, to Rd and moves afterwards the data of Rb (defined by the Offset bit-field) to the assigned memory location (Rb => M[Ra] => Rd). Hence, a load instruction is followed by a store instruction. Both instructions are tied together (atomic), so no interrupt can be executed before the swap instruction has finished. This is very useful for implementing system semaphores.

#### **Assembler Syntax**

Items in { } are optional, whereas items in < > are required. Note the spaces and commas introduced by the lexical rules.

```
1. LDR, STR (load/store from/to memory)
```

```
<LDR|STR> <Rd>, <Ra>, <+|-><Rb|#Imm>, <pre|post>, {!}
```

#### 2. SWP (swap registers with memory)

```
<SWP> <Rd>, <Ra>, <Rb>
```

#### **Assembler Examples**

```
LDR R1, R2, +R3, pre ; R1 <= M[R2+R3]

LDR R1, R2, +R3, pre, ! ; R1 <= M[R2+R3] and set R2=R2+R3 afterwards

LDR R1, R2, -R3, post, ! ; R1 <= M[R2] and set R2=R2-R3 afterwards

LDR R1, R2, +#2, post, ! ; R1 <= M[R2] and set R2=R2+2 afterwards

STR R4, R5, -R6, pre ; R4 => M[R5-R6]

STR R4, R5, -#2, pre, ! ; R4 => M[R5-2] and set R5=R5-2 afterwards

SWP R2, R3, R4 ; M[R3] => R2; R4 => M[R3]
```

#### **Coding Examples**

The assembled instruction are shown in binary (0b ...) and hexadecimal (x"...") format, where the dots in the binary format present the different bit-fields.

#### 4.3. Branch and Link

The instruction encoding of the branch and link instructions is shown in the figure below.



Figure 10: Branch and link instructions format

The branch instruction  $\[Beta]$  is used to perform a relative jump to a different location within a range between -256 and +255 words (remember, 1 word = 2 bytes). The offset is stored as two's complement in the offset bit-field. When using the BL instruction (with L = '1'), a linked branch is executed. Therefore, the return address (PC + 2 bytes) is stored to the link register LR (= R7). The jump can be conditional when using a specific condition suffix for the B/BL instruction from the table below. The different condition suffixes and codes as well as their computation scheme (based on the current state of the ALU flags) are listed in the table below.

| ASM Suffix | Cond code | Condition               | Condition computation (flags) |
|------------|-----------|-------------------------|-------------------------------|
| EQ         | 0000      | Equal                   | Z                             |
| NE         | 0001      | Not equal               | not Z                         |
| CS         | 0010      | Unsigned higher or same | С                             |
| CC         | 0011      | Unsigned lower          | not C                         |
| MI         | 0100      | Negative                | N                             |
| PL         | 0101      | Positive or zero        | not N                         |
| OS         | 0110      | Overflow                | 0                             |
| OC         | 0111      | No overflow             | not O                         |
| HI         | 1000      | Unsigned higher         | C and (not Z)                 |
| LS         | 1001      | Unsigned lower or same  | (not C) or Z                  |
| GE         | 1010      | Greater than or equal   | N xnor O                      |
| LT         | 1011      | Less than               | N xor O                       |
| GT         | 1100      | Greater than            | (not Z) and (N xnor O)        |
| LE         | 1101      | Less than or equal      | Z or (N xor O)                |
| TS         | 1110      | Transfer flag set       | Т                             |
| AL         | 1111      | Always                  | 1                             |

Table 11: Condition codes

A branch (and link) is only executed if the specified condition is true or when there is no conditional suffix.

## **Assembler Syntax**

Items in { } are optional, whereas items in < > are required. Note the spaces and commas introduced by the lexical rules.

#### B (branch, conditional or unconditional)

```
<B>{L} {cond} <label>

{L} Store return address to link register when present.
{cond} Condition code from the table above. If not present, 'always' (AL) condition is used.
<label> Branch label, relative offset in two's complement (max -256/+255 words).
```

## **Assembler Examples**

#### **Coding Examples**

The assembled instruction are shown in binary (0b ...) and hexadecimal (x"...") format, where the dots in the binary format present the different bit-fields.

```
B label_2 = 0b 10.1111.0.000000001 = x"BA01"

label_2:
    BL subr_1 = 0b 10.1111.1.000000001 = x"BE01"

subr_1:
    BLEQ subr_1 = 0b 10.0000.1.111111111 = x"83FF"
```

#### 4.4. Load Immediate

by Stephan Nolting

The instruction encoding of the load immediate instructions is shown in the figure below.



Figure 11: Load immediate instructions format

The load immediate instructions are used to load an 8-bit constant encoded within the instruction to the high byte or sign extended to all bits of the register Rd, respectively. The immediate constant itself is constructed from bit 10 concatenated with bits 6 downto 0 of the instruction word. The LDIL (M = '0') mnemonic will load the immediate to the low byte of Rd. All bits of the high byte of Rd will be loaded with the most significant bit of the immediate. This results in a complete load of Rd with the sign (bit 7 of the immediate  $\rightarrow$  bit 10 of the instruction opcode) extended immediate. The LDIH (M = '1') mnemonic will load the immediate to the high byte of Rd, leaving the low byte of Rd unchanged. When loading a true 16-bit immediate to register, make sure to load the low byte of it first, otherwise the high byte will be discarded.

#### **Assembler Syntax**

Items in { } are optional, whereas items in < > are required. Note the spaces and commas introduced by the lexical rules.

```
LDIL / LDIH (load immediate 8-bit constant to lower/upper byte)
```

```
<LDI><L|H> <Rd>, <#Imm>

<L|H>
Load only high byte of destination register (H) or load whole register with sign extended immediate (L).

<Rd>Destination register.

<#Imm> 8-bit "unsigned" immediate value; with present #-prefix.
```

## **Assembler Examples**

```
(linear execution of all following instructions is assumed)

LDIL R4, #255 ; load sign extended 255 (= -1) to R4 (R4 = x"FFFF")

LDIL R4, #2 ; load sign extended 2 to R4 (R4 = x"0002")

LDIH R4, #7 ; load 7 to the high byte of R4 (R4 = x"0702")
```

#### **Coding Examples**

The assembled instruction are shown in binary (0b ...) and hexadecimal (x"...") format, where the dots in the binary format present the different bit-fields.

```
LDIL R4, #255 = 0b 11.00.0.1.100.1111111 = x''C67F''

LDIL R4, #2 = 0b 11.00.0.0.100.000010 = x''C202''

LDIH R4, #7 = 0b 11.00.1.0.100.0000111 = x''CA07''
```

## 4.5. Bit Manipulation

The instruction encoding of the bit manipulation instructions is shown in the figure below.



Figure 12: Bit manipulation instructions format

The bit manipulation instruction are used to manipulate a single bit of a register and to store the result to the same or another register, whereas the previous state of the bit is irrelevant. The actual bit is addressed by an 4-bit immediate in the Bit-field.

The SBR instruction will set the assigned bit to '1', whereas the CBR instruction clears the bit. A store of the assigned bit to the T-flag is possible by using the STB instruction. For this case, the Rd bit-field is irrelevant and must be set to "000". The LDB instruction loads the current state of the T-flag to the assigned bit. The different option codes (M and S bits) of the four bit manipulation instructions are shown in the table below.

| M | S | Function                                                                                          |
|---|---|---------------------------------------------------------------------------------------------------|
| 0 | 0 | Take data from register Ra, <u>clear</u> the assigned bit and store the result to Rd              |
| 0 | 1 | Take data from register Ra, set the assigned bit and store the result to Rd                       |
| 1 | 0 | Take data from register Ra, <b>copy</b> the T-flag to the assigned bit and store the result to Rd |
| 1 | 1 | Take the assigned bit from register Ra and store it to the T-flag; no data write back to Rd       |

Table 12: Bit manipulation operations

The Atlas CPU only features a T-flag-based branch, that is executed whenever the T-flag is set (BTS / BLTS). But for many applications it might be necessary to branch when a bit, loaded to the T-flag, is cleared. Therefor, a more efficient way than using two branches have been implemented. The bit loaded from a register into the T-flag can be inverted during the transfer to adapt to this situations. Then, a BTS branch command will execute when the original bit of the register is zero. To invert a bit while it is being transferred to the T-flag, use the "store bit to T-flag and invert instruction" STBI. Note: The original source bit of the register is not affected by this instruction.

#### **Assembler Syntax**

Items in { } are optional, whereas items in < > are required. Note the spaces and commas introduced by the lexical rules.

1. SBR, CBR (set/clear bit from register Ra and write result to Rd)

```
<SBR|CBR> <Rd>, <Ra>, <#Imm>
```

2. LDB (load bit from T-flag and write result to Rd)

```
<LDB> <Rd>, <Ra>, <#Imm>
```

3. STB/STBI (store bit to T-flag / store inverted bit to T-flag)

```
<STB>{I} <Ra>, <#Imm>
```

```
Invert source bit while it is transferred to the T-flag when present.
```

<Rd> Destination register.

<Ra> Source register.

<#Imm> 4-bit immediate value assigning the desired bit; with present #-prefix.

#### **Assembler Examples**

```
SBR R3, R4, #4 ; set bit 4 of R4's data and store result to R3
CBR R0, R0, #12 ; clear bit 12 of register R0
STB R7, #1 ; store bit 1 of R7 to the T-flag
STBI R7, #1 ; store inverted bit 1 of R7 to the T-flag
LDB R7, R0, #5 ; copy T-flag to bit 5 of R0's data and store result ; to R7
```

#### **Coding Examples**

The assembled instruction are shown in binary (0b ...) and hexadecimal (x"...") format, where the dots in the binary format present the different bit-fields.

```
SBR R3, R4, #4 = 0b 11.01.0.1.011.100.0100 = x"D5C4"

CBR R0, R0, #12 = 0b 11.01.0.0.000.000.1100 = x"D00C"

STB R7, #1 = 0b 11.01.1.1.000.111.0001 = x"DC71"

STBI R7, #1 = 0b 11.01.1.1.001.111.0001 = x"DCF1"

LDB R7, R0, #5 = 0b 11.01.1.0.111.000.0101 = x"DB85"
```

by Stephan Nolting

## 4.6. Coprocessor Data Processing

The instruction encoding of the coprocessor data processing instructions is shown in the figure below.



Figure 13: Coprocessor data processing instructions format

The coprocessor data processing instruction CDP is used to control one of the two external coprocessor to perform a specific coprocessor-internal operations. The actual functionality of this instruction correspond to the implemented coprocessor. However, it is designed to specify two coprocessor registers, which can be used as source and as source and destination register for operations. A function control can be determined via the three-bit CMD immediate bit-field. Register addresses as well as the command opcode are directly displayed to the coprocessor port. See the coprocessor chapter in the architecture section of this data sheet for more information.

#### **Assembler Syntax**

Items in { } are optional, whereas items in < > are required. Note the spaces and commas introduced by the lexical rules.

#### CDP (coprocessor data processing)

```
<CDP> <#CP>, <Ca>, <Cb>, <#Cmd>

<#CP>
    Coprocessor ID ("#0" or "#1")

<Ca>    Coprocessor operand A / destination register.

<Cb>    Coprocessor operand B register.

<#Cmd>    3-bit immediate value presenting a coprocessor command.
```

#### **Assembler Examples**

```
CDP #0, C0, C0, #4 ; instruct CP 0 to execute command 4 on registers c0 and c0 and place result in register c0 CDP #1, C7, C3, #1 ; instruct CP 1 to execute command 1 on registers c7 and c3 and place result in register c7
```

## **Coding Examples**

The assembled instruction are shown in binary (0b ...) and hexadecimal (x"...") format, where the dots in the binary format present the different bit-fields.

```
CDP #0, C0, C0, #4 = 0b 11.10.0.0.000.000.0100 = x''E004'' CDP #1, C7, C3, #1 = 0b 11.10.0.1.111.011.0.001 = x''E7B1''
```

## 4.7. Coprocessor Data Transfer

The instruction encoding of the coprocessor data transfer instructions is shown in the figure below.



Figure 14: Coprocessor data transfer instructions format

To exchange data between a coprocessor register and an Atlas CPU register, the MRC (load data from coprocessor) and MCR (store data to coprocessor) instructions are used. Parallel to the data transfer, a command can be specified to trigger additional coprocessor operations. The L-bit determines the transfer direction (load: L = 0, store: L = 1).

#### **Assembler Syntax**

Items in { } are optional, whereas items in < > are required. Note the spaces and commas introduced by the lexical rules.

```
1. MRC (load data from coprocessor)
```

```
<MRC> <#CP>, <Rd>, <Ca>, <#Cmd>
```

<MRC> <#CP>, <Cd>, <Ra>, <#Cmd>

#### 2. MCR (store data to coprocessor)

#### **Assembler Examples**

```
MRC \#0, R3, C4, \#1 ; CP0: R3 <= C4 and execute CMD 1 MCR \#1, C7, R3, \#0 ; CP1: C7 <= R3 and execute CMD 0
```

#### **Coding Examples**

The assembled instruction are shown in binary (0b ...) and hexadecimal (x"...") format, where the dots in the binary format present the different bit-fields.

```
MRC #0, R3, C4, #1 = 0b 11.10.1.0.011.100.0.001 = x"E9C1"

MCR #1, C7, R3, #0 = 0b 11.10.1.1.111.011.1.000 = x"EFB8"
```

## 4.8. Multiply-and-Accumulate

The instruction encoding of the multiply and multiply-and-accumulate instructions are shown in the figure below.



Figure 15: MUL / MAC instruction formats

These instructions provide extended arithmetical functions for multiplication operations. The MUL instruction will multiply Ra and Rb and place the lowest 16 result bits in Rd (Rd <= (Ra\*Rb)(15:0)). The MAC instruction will also multiply Ra and Rb, but will also add the content of Rd to the lowest 16 bits of the multiplication result. The result is then placed in Rd (Rd <= (Ra\*Rb)(15:0)+Rd). None of these instruction will perform any flag modification. Since a multiplication with an optional addition requires a lot of area, the actual synthesis of the MUL and MAC instructions can be enabled or disabled using architecture constants in the Atlas Processor package VHDL file. By default, only the MUL instruction will be synthesized. When trying to execute an instruction, that has not been synthesized (in this case the multiply-and-accumulate instruction), the software interrupt trap will be taken.

#### **Assembler Syntax**

Items in { } are optional, whereas items in < > are required. Note the spaces and commas introduced by the lexical rules.

```
MUL, MAC (multiply / multiply-and-accumulate)  <MUL | MAC> <Rd>, <Ra>, <Rb>
```

#### **Assembler Examples**

```
MUL R0, R1, R2 ; R0 = R1 * R2
MAC R0, R1, R2 ; R0 = R1 * R2 + R0
```

#### **Coding Examples**

The assembled instruction are shown in binary (0b ...) and hexadecimal (x"...") format, where the dots in the binary format present the different bit-fields.

```
MUL R0, R1, R2 = 0b 11.11.00.000.001.0.010 = x"F012"
MAC R0, R1, R2 = 0b 11.11.00.000.001.1.010 = x"F01A"
```

# by Stephan Nolting

## 4.9. Undefined Instructions

The instruction encoding of the undefined instructions is shown in the figure below.



Figure 16: Undefined instruction formats

These instruction types are not implemented yet and are used to keep some space for further instruction set extensions. Therefore, these instructions **should not be used**. However, when executed, the undefined instructions will behave like a system call with a tag corresponding to the lowest 10 bits of the instruction.

# 4.10. System Call

The instruction encoding of the system call instruction is shown in the figure below.



Figure 17: System call instruction format

The system call (SYSCALL) instruction is used to enter system mode from a running user program (software interrupt). When executed, program execution will stop, the re-entry point (return address) plus 2 bytes offset will be stored in the system link register, the mode will be changed to system mode and program execution will resume at the software interrupt address. The lowest 10 bits of the instruction can be used to directly transfer an argument (tag) to the software interrupt handler. This tag can be extracted by the handler after loading the system call's causing instruction.

When executing the SYSCALL instruction in system mode, the instruction will behave like a branch and link instruction to the software interrupt vector. Note: When returning with RTX from the software interrupt handler, the original program will be resumed in user mode rather than in system mode.

<u>Note</u>: The software interrupt will also be executed for example whenever a user-mode program attempts an unauthorized access to a coprocessor or to restricted bits of the MSR.

#### **Assembler Syntax**

Items in { } are optional, whereas items in <> are required. Note the spaces and commas introduced by the lexical rules.

#### **Assembler Examples**

```
SYSCALL #1002 ; trigger software interrupt with '1002' as tag
SYSCALL ; trigger software interrupt with no tag
```

### **Coding Examples**

The assembled instruction are shown in binary (0b ...) and hexadecimal (x"...") format, where the dots in the binary format present the different bit-fields.

```
SYSCALL #1002 = 0b 11.11.11.1111101010 = x"FFEA"

SYSCALL = 0b 11.11.11.0000000000 = x"FC00"
```

### 5. The Atlas Evaluation Assembler

I've programmed a small assembler, that is capable of assembling the previously explained instructions into an Atlas CPU-compatible VHDL program memory initialization file. The assembler is still very rudimentary, but it can already be used to write and assemble complex programs. The program is located in the *core/asm* folder and can be run using the Windows command prompt. The actual assembly program is passed as first argument when calling the assembler. A simple example program, which introduces the lexical rules and the basic layout of an assembler program, can also be found in that folder.

To assemble this example file, type and execute this in your Windows command prompt:

```
...\core\asm>atlas_asm test.asm
```

In the same folder, the program will generate a "init.vhd" file, which contains the data initialization area of a VHDL memory declaration (the program memory). The "out.bin" file contains the assembled program in binary format and is dedicated for the future use of a bootloader.

#### 5.1. Pre-Processor Instructions

The pre-processor instruction can make assembler-life much more easy, since they present different features to create more abstract programs. See the *test.asm* file in the *core/asm* folder for an example assembler program including all the different pre-processor instructions.

Only some rudimentary instruction are supported yet, but hopefully the pre-processor capabilities will grow in future;)

| Instruction | Example                                                                   | Function                                                                                                                                                                                                                                                    |
|-------------|---------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| .equ        | <pre>.equ temp r4 .equ sys_reg c1 .equ de_val #1 .equ mem_size #256</pre> | This instruction allows to use aliases for the CPU register (r0,, r7), the coprocessor register (c0,, c7) or immediate values (positive integers, 16-bit, decimal representation, introduced with '#'-prefix)                                               |
| .space      | .space #4 .space mem_size                                                 | The space instruction will create an area of a given size, that is initialized with zeroes (x"0000" = NOPs)                                                                                                                                                 |
| .dw         | .dw #23432                                                                | The dw instruction can be used to directly initialize the corresponding memory position with a positive, 16-bit immediate (decimal value, introduced with '#'-prefix), with a previously defined .equ-definition or with a branch label address ("[label]") |

Table 13: Pre-processor instructions

# 5.2. Programming & Simulating the Processor

To easily evaluate and simulate a program for the Atlas processor, the "processor\_tb.vhd" testbench in the core/sim/testbench\_processor\_system folder can be used. This testbench includes the top entity of the Atlas processor, together with a Wishbone-compatible demo memory component ("TEST\_MEM.vhd", in the same folder as the testbench), which is directly connected to Wishbone interface of the processor (without any interconnection fabric). This chapter will take this setup as initial point for the simulation tutorial.

When you are using Xilinx Isim for simulation, you can use the predefined waveform configuration file "PROCESSOR\_WAVE.wcfg" in the core/sim/isim\_wave folder. This waveform contains already all relevant signals of the complete processor.

Of course, it is also possible, to simulate programs on the CPU core only. In this case, you have to create a compatible memory component (which is very simple), that contains the assembled program. The chapter about the rtl hardware of the core presents the specifications of the data and/or instruction memory and it's interface.

To implement the assembled program into the simulation environment, the content of the "init.vhd" file, generated by the Atlas assembler, has to be copied to the memory initialization area of the memory VHDL component file.

The following code presents a cutout of the "TEST MEM.vhd" component.

Open the "init-vhd" file from the asm folder, copy all of the content and paste it between the two brackets of the MEM\_FILE signal initialization (right there, where the red note is).

Afterwards, the cutout should look somehow like the following example.

# 5.3. Example Programs

This chapter presents some example program fragments, that illustrate how to use the Atlas assembler mnemonics to create your own application programs. Note, that of course all code fragments need to be included into a 'real' program to run properly.

#### **5.3.1. Bit Test**

This is an example of how to use the T-flag to implement bit test operations.

Bit test operations are also very often used to leave a linear program execution. Since the BTS (branch if T-flag is set) instruction only executes, when the T-flag is set, the following implementation of a taken branch whenever a bit is zero seems obvious.

But we can do better than that! The bit, which is stored to the T-flag, can be inverted during the transfer. Thus, a true zero-testing branch using also the BTS instruction can be implemented.

### 5.3.2. Comparing Large Operands

The CPX instructions allows to compare two registers while also taking the zero and carry flags of a previous comparison into account. This is very suitable for implementing a comparison of two arbitrarily wide operands.

### 5.3.3. Loop Counters

Conditional loops are one of the basic elements within a program. The following example shows an example of how to implement loops with a small overhead.

### 5.3.4. MAC Operation with Flag Update

Neither the MAC nor the MUL instruction features a status flag update. Also, a synthesized MAC instruction requires a lot of additional hardware resources. So, if a MAC instruction with flag update is required, it is suitable to only allow the synthesis of the MUL instruction and construct the actual MAC operation with additional instructions.

```
;constructed MAC operation with flag update
;executed in system or user mode

; compute R0=R1*R2+R3 and set flags corresponding to the result

MUL R0, R1, R2 ; R0 = R1 * R2
ADDS R0, R0, R3 ; R0 = R0 + R3 and set status flags
```

#### 5.3.5. Branch Tables

Branch or call tables are a good method to easily jump to different locations, without the need of comparing a register with immediate values. For example, this kind of value-defined branching can be used to trigger different operation using the system call instruction with a tag, where this tag represents the actual subroutine number, that shall be called. Note, that in the following example, only 16-bit addresses are used. Thus, the subroutine must be in the same page as the branch-table code.

```
;branch/call table (subroutine addresses are 16-bit, so in the same page)
      ; executed in system or user mode
      ; R4 presents the number of subroutine to be called
       ; thus, a 2 in R4 would call subroutine 2
      ; first we have to load the 16-bit base address of the branch table
      LDIL R0, #low[branch_table] ; load low byte of label address
LDIH R0, #high[branch_table] ; load high byte of label address
      ; multiply index by two by left-shifting one position, this is necessary, because
       ; each subroutine address in the table is 16-bit wide and the Atlas CPU uses
       ; byte addressing mode by default
      SFT R4, R4, #LSL
      ADD R0, R0, R4
                                  ; add offset to table base address
      LDR R1, R4, +#0, PRE
                                  ; load address to R1 and perform no further indexing
      GTL R4
                                  ; got and link - branch to the loaded address in r4 and
                                  ; save return address to the link register
branch table:
                                  ; beginning of branch table
.DW [subroutine 0]
                                  ; absolute 16-bit address of label "subroutine 0"
                                 ; absolute 16-bit address of label "subroutine 1"
.DW [subroutine_1]
.DW [subroutine_2]
                                 ; absolute 16-bit address of label "subroutine_2"
                                  ; absolute 16-bit address of label "subroutine 3"
.DW [subroutine 3]
. . .
```

# 6. Core Architecture

This chapter takes a closer look at the actual rtl implementation of the CPU core.

# **6.1. Module Description**

The following table presents the different Atlas VHDL rtl files and their functionality.

| File name           | Functionality                                                                                                                                                                                          |
|---------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| ATLAS_PROCESSOR.vhd | This is the top entity of the complete processor system, including all files mentioned below.                                                                                                          |
| BUS_INTERFACE.vhd   | The bus interface incorporates a shared I/D-cache as well as a Wishbone compatible bus interface to communicate with other modules of the SoC.                                                         |
| MMU.vhd             | The memory management unit, implemented as system coprocessor, allows to extend the accessible memory space, separated for the different operating modes.                                              |
| ATLAS_pkg.vhd       | CPU package file. All architecture constants and system configuration options can be found here. Also, the endianness, synthesized hardware modules and present coprocessor are declared in this file. |
| ATLAS_CORE.vhd      | Top entity of the CPU providing all external interface signals to directly communicate with RAM/ROM; all CPU sub modules are instantiated in this file.                                                |
| OP_DEC_vhd          | Opcode decoder. The instruction opcodes are decoded into processor control signals in this file.                                                                                                       |
| CTRL. vhd           | This file provides the control "spine" of the processor. Intermediate control signal computations and the control signal buffers for each pipeline stage are located here.                             |
| SYS_REG.vhd         | The system register file contains the program counter, the machine status register and the interrupt and context control circuits.                                                                     |
| REG_FILE.vhd        | This file contains the main data register file, organized as 16*16-bit memory.                                                                                                                         |
| ALU.vhd             | The ALU holds the primary arithmetical/logical unit, the coprocessor interface as well as the multiplication unit.                                                                                     |
| MEM_ACC.vhd         | All data memory requests emerge from this unit. Furthermore, processing result routing circuits are located here.                                                                                      |
| WB_UNIT.vhd         | The write-back unit takes data from the coprocessors, the ALU or the data memory interface and writes it back to the register file.                                                                    |

Table 14: Atlas VHDL rtl files description

# 6.2. Data Path

More to come... ^^

# 6.3. Data Registers

For efficient hardware implementation, the 16 data registers are mapped to a 16x16-bit memory block. The most significant bit of the register address (bit 3) indicates the accessed bank ('0' = user bank, '1' = system bank). The actual register – memory cell mapping is presented in the table below.

| 0000: User        | R0 <b>0100:</b> | User R4 | 1000: | System | R0 | 1100: | System | R4 |
|-------------------|-----------------|---------|-------|--------|----|-------|--------|----|
| <b>0001:</b> User | R1 <b>0101:</b> | User R5 | 1001: | System | R1 | 1101: | System | R5 |
| <b>0010:</b> User | R2 <b>0110:</b> | User R6 | 1010: | System | R2 | 1110: | System | R6 |
| <b>0011:</b> User | R3 <b>0111:</b> | User R7 | 1011: | System | R3 | 1111: | System | R7 |

Figure 18: Register mapping to memory block

Note: The register file might be implemented using LUT registers instead of dedicated memory blocks on some FPGAs, since not all FPGAs provide dedicated memory, that can be accessed asynchronous when reading data.

# 6.3. Pipeline

A classical 5-stage pipeline is implemented in the Atlas CPU. Just to clarify the terms of "pipeline stages", a stage starts always with the update of the register, that drive a specific stage. Also, a cycle starts with the update of a register on a rising edge of the system clock. The table below shows the present pipeline stages of the CPU.

| Stage #      | Name                                 | Functionality                                                                                                                                                                                                                                                                                                                                     |
|--------------|--------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 1: <b>IF</b> | Instruction fetch                    | At the beginning of this stage, the program counter (PC) is updated with the next instruction address. For linear programs, this value for the PC is old_value plus 2 bytes. This address is then applied to the instruction memory.                                                                                                              |
| 2: <b>OF</b> | Instruction decode and operand fetch | The instruction memory accepts the address and outputs the corresponding instruction on the rising edge of the system clock. The opcode decoder decodes the opcodes an loads operand form the register file and also constructs immediate values.                                                                                                 |
| 3: <b>EX</b> | Execution                            | In the execution stage, the main data processing takes place. Furthermore, data is presented to the external coprocessors, the PC and the MSR, depending on the current instruction.                                                                                                                                                              |
| 4: <b>MA</b> | Memory access                        | The memory access stage provides write data and the correlated address to the data memory. Also, data read backs from the coprocessor are read in this cycles.                                                                                                                                                                                    |
| 5: <b>WB</b> | Write back                           | The write back stage accepts read data from the memory or any kind of read data from the previous stage (coprocessor, MSR, ALU processing result) and applies it to the register file, whenever a data write back is valid. With the next rising edge, this data is stored to the destination register and thus the execution cycle is completed. |

Table 15: Atlas CPU pipeline stages

### 6.3.1. Local Pipeline Conflicts

Whenever data is needed, that has already been processed but has not yet reached the end of the pipeline, a local data dependency occurs. For data, that will be processed by the ALU, the source and destination data can be separated by 1, 2 or 3 cycles in the pipeline. The following example program illustrates these types of local conflicts (the NOPs are only exemplary used to generate the corresponding distances).

```
;1 cycle distance:
inc r4, r1, #1
                          ; r4 = r1 + 1
cmp r4, r1
                          ; compare r4 and r1
;2 cycles distance:
dec r5, r1, #1
                          ; r5 = r1 - 1
nop
tst r5, r5
                          ; set flags to r5 AND r5
;3 cycles distance:
                          ; swap bytes of r1 and store to r6
sft r6, r1, #swp
nop
add r6, r6, r6
                          ; r6 = r6 * 2
```

Two different forwarding units are used to prevent pipeline stalls whenever these kinds of local data dependencies occur. The first one is located in the OF-stage and can forward data from the WB-stage (data separation by 3 cycles) into the two operand slots of the ALU. The second one is located in the EX-stage and can forward data from the MA-stage (data separation by 1 cycle) and from the WB-stage (data separation by 2 cycles) into the two operand slots of the ALU.

| <u>Cycle</u> | IF  | OF    | EX  | MA    | WB    |                      |
|--------------|-----|-------|-----|-------|-------|----------------------|
| n+0          | DEC | СМР   | INC |       |       |                      |
| n+1          | NOP | DEC   | CMP | - INC |       | 1 cycle<br>distance  |
| n+2          | TST | NOP   | DEC | CMP   | INC   | ars sames            |
| n+3          | SFT | TST   | NOP | DEC   | CMP   |                      |
| n+4          | NOP | SFT   | TST | NOD   | - DEC | 2 cycles<br>distance |
| n+5          | NOP | NOP   | SFT | CMP   | INC   |                      |
| n+6          | ADD | NOP   | NOP | SFT   | INC   |                      |
| n+7          |     | ADD < | NOP | NOD   | - SFT | 3 cycles<br>distance |
|              |     |       |     |       |       | -                    |

Figure 19: Processing data forwarding

Furthermore, the CPU features two small additional forwarding units to accelerate memory data transfers. The first one is also located in the EX-stage and can forward data from the WB-stage into the ALU bypass operand slot. The second one is located in the MA-stage and can forward data from the WB-stage into the write data port of the data memory.

## **6.3.2. Temporal Pipeline Conflicts**

Temporal data dependencies occur, whenever the operand fetch stage tries to forward data for ALU processing that has not been yet fetched from the data memory. The following example illustrates this kind of data conflict.

```
;memory read-data dependency
ldr r1, r0, +#2, pre  ; r1 = MEM[r0+2], not address pointer update
inc r1, r1, #1  ; r1 = r1 + 1
```

This type of dependency cannot be solved by forwarding alone. The CPU has to insert an empty "dummy cycle" (a NOP) to stop the data processing instruction in the OF-stage until the source data from the memory is available.



Figure 20: Memory read-data temporal data dependency

While the INC instruction is still in the OF-stage, the memory load instruction (LDR) has reached the MA-stage and the fetched data can be forwarded to the OF-stage.

#### 6.3.2.1. MSR Write Access

Whenever the machine status register (MSR) is updated via the STSR (or an alias instruction like RTX) instruction, a dummy cycle has to be inserted afterwards. Imagine a system mode program, that clears the M-flag by writing new data to the MSR to switch to user mode.

The operand fetch has to wait until this update is completed, because the M-flag determines the most significant bit of the register addresses and thus the actual register bank, where data is taken from. Since the M-flag is cleared now, the new data for the INC instruction has to be fetched from the user register bank and not from the system register bank. Therefore a dummy instruction slot is necessary.

| <u>Cycle</u> | IF   | OF   | EX    | MA    | WB   |                      |
|--------------|------|------|-------|-------|------|----------------------|
| n+0          | CBR  | LDRS |       |       |      | ]                    |
| n+1          | STSR | CBR  | LDRS  |       |      |                      |
| n+2          | INC  | STSR | CBR   | LDRS  |      |                      |
| n+3          |      | INC  | STSR  | CBR   | LDRS | Conflict detected!   |
| n+4          |      | INC  | dummy | STSR  | CBR  | Dummy cycle inserted |
| n+5          |      |      | INC   | dummy | STSR | Inder ted            |

Figure 21: MSR update, status dependency

Even if only the mode (M) and the transfer (T) flags are vulnerable for these kind of conflicts, any kind of manual MSR update causes the system to insert a dummy cycle – this simplification dramatically reduces the hardware overhead. But since MSR updating instructions are very rare in most program codes, this issue should not be further relevant.

#### 6.3.4. Branches

Branches are necessary to leave the linear processing of a program. They occur whenever an unconditional or a conditional branch instruction with fulfilled condition is executed. Also, a manual PC write access via the STPC instruction (or any alias instruction like RET) will result in a branch to the new address. The Atlas CPU does not use any kind of branch prediction, therefore the strategy is "branches are always taken".

When the PC is loaded with a new address, the instructions, which were already loaded after the branch causing instruction into the pipeline, have to be invalidated ("pipeline flush").

| <u>Cycle</u> | IF  | OF  | EX  | MA             | WB             |                   |
|--------------|-----|-----|-----|----------------|----------------|-------------------|
| n+0          | ADD | В   |     |                |                |                   |
| n+1          | SUB | ADD | В   |                |                | Branch detected!  |
| n+2          | INC | SUB | ADD | В              |                | Flushing pipeline |
| n+3          |     | INC | SUB | <del>ADD</del> | В              | Papeanie          |
| n+4          |     |     | INC | SUB            | <del>ADD</del> |                   |
| n+5          |     |     |     | INC            | SUB            |                   |

Figure 22: Flushing the pipeline after a taken branch

Since it takes two cycles to fetch a new instruction into the opcode decoding OF-stage after a nonlinear PC update, the two following instructions after the branch are not up-to-date anymore and have to be dismissed.

#### 6.3.5. Exceptions and Interrupts

Exceptions and interrupts behave in most ways like branches. Whenever a specific event occurs, for instance the execution of the software interrupt instruction (SYCALL), a branch to a corresponding address (address of the software interrupt vector in this case) takes place. An automatic context change is performed by the system to offer a system state, that does not effect the interrupt program. While exceptions (or processor-internal interrupts) can only occur synchronous to the pipeline / instruction flow, external interrupts can occur at every time. Thus, the interrupt-correlated mode changes and branches need to be synchronized to the pipeline. Therefore, this kind of interrupts can only be processed whenever the current instruction in the EX stage can be interrupted and resumed without any problems. Hence, the instruction must not be a multicycle operation nor a branch nor an instruction with a temporal data dependency.

#### 6.4. Interfaces

The CPU needs an interface to an CPU external data/instruction memory and maybe also to external coprocessors to operate. All interfaces are fully synchronous to the CPU's main clock. The different interfaces are about to be explained in this chapter.



Figure 23: The three main interfaces of the Atlas CPU

# 6.4.1. Memory Interface

This chapter will focus on the "stand alone" implementation of the Atlas CPU. When in stand-alone mode, the CPU only requires a shared/distributed program/data memory together with user hardware connected as coprocessors (exemplary implementation style).

The CPU can either be configured to use different memories or caches for data and instructions (Harvard architecture) or to use a shared memory/cache for data and instructions (Von-Neumann architecture). As an example, this chapter will take a closer look on a Harvard-like implementation.



Figure 24: Interface to separated data and instruction memories; signal names correspond to the CPU's instruction/data interface ports

Let's start with the instruction fetch interface. This interface is very simple to implement. It basically consists of the the instruction address (INSTR\_ADR\_O), the instruction word read back (INSTR\_DAT\_I) and an enable signal (INSTR\_EN\_O). The instruction address outputs the current value of the CPU's program counter. On every rising edge of the core clock, the instruction memory outputs the instruction word to the instruction word read back line corresponding to the applied instruction address. Whenever the instruction enable line (INSTR\_EN\_O) goes low, the instruction memory is disabled and it has to hold the last instruction word, since this buffer is used as instruction register.

The data interface operates nearly in the same manner. Here, the enable signal ( $MEM_REQ_O$ ) is applied one cycle before the actual memory access (data and address) takes place. Therefore, it has to be buffered in a flip flop (on the rising edge of the CPU clock) to create the necessary delay. This behavior can be used to tell a memory management system in advance, that the core requests access to the memory. Thus, the delayed enable signal triggers the operation of the memory. Just like the instruction memory, the data memory has to keep the last data output if the enable signal goes low again. Corresponding to the read/write select signal ( $MEM_RW_O$ ), data is stored to the memory (r/w = '1') or read from the memory (r/w = '0'). The access address is presented by the address output port ( $MEM_ADR_O$ ). The store-data comes from the memory write data port ( $MEM_DAT_O$ ). Read-back data from the memory is applied on the rising edge of the CPU clock to the read data port ( $MEM_DAT_I$ ).

#### **6.4.2.** Wishbone Interface

For many applications, the direct connection of the CPU to data/instruction memory/memories might be sufficient (CPU-only implementation), but however many applications require more accessible memory and also some kind of integrated bus to communicate with other SoC modules (like timers, interfaces, ...). The Atlas processor implementation features a Wishbone-compatible bus interface to access other system components via an on chip network fabric (a copy of the Wishbone specifications can be found in the *core/doc* folder).

To allow an efficient use of the bus system, a shared instruction and data cache is connected to the Wishbone bus interface. Furthermore, a memory management unit (MMU) is inserted to extend the accessible memory area to up to 4GB. Of course, the MMU can be bypassed (and therefore removed from the design) when 64kB of addressable memory/IO space within the Wishbone network is sufficient.

The basic structure of the Atlas processor is shown in the figure below. By default, no user coprocessor is implemented within the Atlas processor.



Figure 25: Atlas Processor block diagram

### 6.4.3. Coprocessor Interface

The coprocessor interface is dedicated to connected up to two external coprocessors (abbreviated as CP) directly to the Atlas CPU without the need of coupling them via some kind of system bus. This feature allows to create a small microprocessor system with two tightly coupled processing devices. The data communication between CPU and a CP is based on direct register transfers between the two entities. Furthermore, direct data manipulation operations specifying two registers of the CP and a command are also implemented. For more information about the transfer and processing instructions, refer to the coprocessor instruction references.

The signal names and their functionality of the Atlas CPU coprocessor interface port are shown in the table below. All CPU output signals of the coprocessor interface are connected to both coprocessors, except for the two enable signals.

| Signal name | Size (bit) | Direction | Function                                                                                                                                                                       |
|-------------|------------|-----------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| USR_CP_EN_O | 1          | out       | Coprocessor 0 (user coprocessor) chip select (active high)                                                                                                                     |
| SYS_CP_EN_O | 1          | out       | Coprocessor 1 (system coprocessor) chip select (active high)                                                                                                                   |
| CP_OP_O     | 1          | out       | Data transfer ('1') or CP data processing ('0') operation                                                                                                                      |
| CP_RW_O     | 1          | out       | Write data to CP ('1') or read data from CP ('0')                                                                                                                              |
| CP_CMD_O    | 9          | out       | Command interface Bit 2 downto 0: Direct output of the CMD bit-filed of CP instruction Bit 5 downto 3: CP operand B address Bit 8 downto 6: CP operand A / destination address |
| CP_DAT_O    | 16         | out       | Write data to both coprocessors                                                                                                                                                |
| CP_DAT_I    | 16         | in        | OR-ed read data of both coprocessors                                                                                                                                           |

Table 16: Coprocessor interface port of the Atlas CPU

The basic layout of the coprocessor-CPU connection is illustrated in the figure below. Note, that because of the OR-ed data read-back, the coprocessors have to ensure, that they output a zero (x"0000") on their data read output port, whenever they are not enabled by the SYS CP EN or USR CP EN signal, respectively.



Figure 26: Coprocessor interface; signal names correspond to the CPU's CP interface port

# 6.5. System Coprocessor (MMU)

By default, the Atlas processor features a memory management unit (MMU). This MMU, implemented as system coprocessor (coprocessor #1), enables the user to access a memory/IO space of up to 2<sup>32</sup> bytes or 2<sup>31</sup> words (4GB), respectively. Therefore, the actual data and instruction addresses from the CPU, which are 16-bit wide, are concatenated with another 16 bit, determining the accessible data and instruction page, to create a 32-bit wide address.

### **Theory of Operation**

The complete accessible data space of  $2^{32}$  byte is separated into  $2^{16}$  "pages" of  $2^{16}$  byte each. The actual page is selected via the most significant 16 bits of the final address. These page address bits are taken from page registers, where unique register for instruction and data page access for both operating modes exist. Together with the data and instruction address bits of the CPU, that present the least significant 16 bits of the final address, the final address is constructed. Since the MMU is aware of the current CPU operating mode, an automatic switch between the user and system mode page register is implemented.



Figure 27: MMU address generation block diagram; the numbers in the arrows refer to the address widths

Whenever an interrupt or exception occurs, the system I- and D-page registers are automatically set to zero (x"0000"), an immediate zero-page output of the MMU is generated and the last accessed I- and D-pages (the last value of the corresponding system page registers) are store to the I- and D-link register. This makes it easy to restore the last accessed pages after an interrupt has been processed. The data of the D/I-link registers have just to be copied back to the system page registers when the interrupt handler has finished.

#### Interface

The MMU is accessed via the coprocessor interface and the corresponding data transfer and data processing instructions. Remember, that the system coprocessor can only be accessed in system mode. A list of all accessible registers is shown below.

| Register | Name            | Function                                                       |
|----------|-----------------|----------------------------------------------------------------|
| c0       | MMU_CTRL        | MMU control register, see table below                          |
| c1       | MMU_SCRATCH     | Scratch register; free to use                                  |
| c2       | MMU_SYS_I_PAGE  | I-Page for system mode                                         |
| c3       | MMU_SYS_D_PAGE  | D-Page for system mode                                         |
| c4       | MMU_USR_I_PAGE  | I-Page for user mode                                           |
| c5       | MMU_USR_D_PAGE  | D-Page for user mode                                           |
| с6       | MMU_I_PAGE-LINK | Last accessed I-Page after an interrupt/exception has occurred |
| e7       | MMU_D_PAGE_LINK | Last accessed D-Page after an interrupt/exception has occurred |

Table 17: MMU register map

The functionality of the different control register bits is explained in the table below.

| Bit | Name    | R/W | Function                                                                   |
|-----|---------|-----|----------------------------------------------------------------------------|
| 0   | clush   | M   | Flush cache to sync memory with cache when set to '1'                      |
| 1   | cclr    | M   | Invalidate all cache entries (reload cache) when set to '1'                |
| 2   | dda     | R/W | Enable direct data access (bypass cache for data requests) when set to '1' |
| 3   | csync   | R   | Cache is sync to memory when '1'                                           |
| 4   | bus_err | R/W | Bus error has occurred when '1', write a '1' to acknowledge                |
| 515 | -       | R/W | Reserved, should not be altered                                            |

Table 18: MMU control register bits

Since reading, altering and writing back of the control register takes at least three cycles to complete, most of the relevant control functions can also be controlled by using the coprocessor data processing instruction including a specific command code to trigger the configuration bits of the control register. All implemented control commands are listed in the table below. Using unimplemented commands will not have an effect on the MMU.

| Command | Name        | ASM Usage Example |        | le  | Function |                                                            |
|---------|-------------|-------------------|--------|-----|----------|------------------------------------------------------------|
| #0      | flush_cache | CDP #             | 1, c0, | c0, | #0       | Flush cache to sync memory with cache                      |
| #1      | clear_cache | CDP #             | 1, c0, | c0, | #1       | Invalidate all cache entries (reload cache)                |
| #2      | en_dir_acc  | CDP #             | 1, c0, | c0, | #2       | Enable direct data access (bypass cache for data requests) |
| #3      | dis_dir_acc | CDP #             | 1, c0, | c0, | #3       | Disable direct data access                                 |
| #4      | ack_bus_err | CDP #             | 1, c0, | c0, | #4       | Acknowledge bus error interrupt                            |
| #5      | link_copy   | CDP #             | 1, c0, | c0, | #5       | Copy link registers back to system page registers          |

Table 19: Currently implemented MMU commands

### **MMU-Interrupt**

The MMU also features an interrupt request output (connected to IRQ1 of the CPU) to indicate a bus error. These bus errors appear, whenever the bus interface does not receive an acknowledge from an accessed address in the Wishbone network within a specific time (max\_bus\_latency\_c constant in the Atlas package file).

### **Update-Latency**

It takes two cycles until a new value written to a page register has an effect to the address output of the MMU. This ensures an executed branch directly after the MMU register updated instruction will result in a branch to the correct position. Note: There must be not delays between the MMU i-base write instruction an the actual branch instruction!

### **ASM Usage Examples**

Below, some examples of how use the MMU are presented.

```
;Absolute branch to label "destination" (within 32-bit address space)
;executed in system mode

LDIL R1, #XLOW[destination] ; load high address (upper 16 bit) of
LDIH R1, #XHIGH[destination] ; label address
LDIL R0, #LOW[destination] ; load lower address (lower 16 bit) of
LDIH R0, #HIGH[destination] ; label address

;there must be no delay between the next two instructions!
MCR #1, C2, R1, #0 ; load high address to MMU's sys-i-page register
GTX R0 ; load low address to PC → finalize branch
```

```
;Flush cache and wait until cache is synchronous to memory
;executed in system mode
;bit 3 of the MMU's control register indicates a synchronous cache

CDP #1, CO, CO, #0 ; directly execute MMU'S "flush cache" command

get_status:
    MRC #1, RO, CO, #0 ; read MMU status register to rO
    STBI RO, #3 ; store inverted bit 3 of rO to T-flag to test it
    BTS get_status ; go to beginning of loop when original sync-flag is 'O'

cache_sync: ; cache is sync when arriving here
```

### 6.6. Main Control Bus

The following table shows the location and signal names of the main system control bus. All primary control signals, which are emerging from the opcode decoder, are forwarded throughout the complete pipeline are combined within this bus. Even if not all signals are used in every single pipeline stage, all signal are carried out until the end of the processing pipeline. This helps to keep the architecture flexible for future changes.

| Bit # | Signal name      | Function                                                                    |  |  |  |  |  |
|-------|------------------|-----------------------------------------------------------------------------|--|--|--|--|--|
|       |                  | Global Control                                                              |  |  |  |  |  |
| 0     | ctrl_en_c        | A '1' indicates a valid operation within the corresponding pipeline stage   |  |  |  |  |  |
| 1     | ctrl_mcyc_c      | ctrl_mcyc_c Multi-cycle/atomic operation in progress, no interrupt possible |  |  |  |  |  |
|       |                  | Operand A                                                                   |  |  |  |  |  |
| 2     | ctrl_ra_is_pc_c  | Operand A is the program counter                                            |  |  |  |  |  |
| 3     | ctrl_clr_ha_c    | Set higher byte of operand A to 0                                           |  |  |  |  |  |
| 4     | ctrl_clr_la_c    | Set lower byte of operand A to 0                                            |  |  |  |  |  |
| 5     | ctrl_ra_0_c      | Operand register A address bit 0                                            |  |  |  |  |  |
| 6     | ctrl_ra_1_c      | Operand register A address bit 1                                            |  |  |  |  |  |
| 7     | ctrl_ra_2_c      | Operand register A address bit 2                                            |  |  |  |  |  |
| 8     | ctrl_ra_3_c      | Operand register A address bit 3, indicating source mode                    |  |  |  |  |  |
|       |                  | Operand B                                                                   |  |  |  |  |  |
| 9     | ctrl_rb_is_imm_c | Operand B is an immediate                                                   |  |  |  |  |  |
| 10    | ctrl_rb_0_c      | Operand register B address bit 0                                            |  |  |  |  |  |
| 11    | ctrl_rb_1_c      | Operand register B address bit 1                                            |  |  |  |  |  |
| 12    | ctrl_rb_2_c      | Operand register B address bit 2                                            |  |  |  |  |  |
| 13    | ctrl_rb_3_c      | Operand register B address bit 3, indicating source mode                    |  |  |  |  |  |
|       |                  | Destination Register                                                        |  |  |  |  |  |
| 14    | ctrl_rd_wb_c     | Enable write-back to register file                                          |  |  |  |  |  |
| 15    | ctrl_rd_0_c      | Destination register address bit 0                                          |  |  |  |  |  |
| 16    | ctrl_rd_1_c      | Destination register address bit 1                                          |  |  |  |  |  |
| 17    | ctrl_rd_2_c      | Destination register address bit 2                                          |  |  |  |  |  |
| 18    | ctrl_rd_3_c      | Destination register address bit 3, indicating destination mode             |  |  |  |  |  |
|       |                  | ALU Control                                                                 |  |  |  |  |  |
| 19    | ctrl_alu_fs_0_c  | ALU function select bit 0                                                   |  |  |  |  |  |
| 20    | ctrl_alu_fs_1_c  | ALU function select bit 1                                                   |  |  |  |  |  |
| 21    | ctrl_alu_fs_2_c  | ALU function select bit 2                                                   |  |  |  |  |  |
| 22    | ctrl_alu_usec_c  | Use mode-corresponding carry flag for computation                           |  |  |  |  |  |
| 23    | ctrl_alu_usez_c  | Use mode-corresponding zero flag for computation                            |  |  |  |  |  |
| 24    | ctrl_fupdate_c   | Update ALU flags after processing                                           |  |  |  |  |  |

| Bit #                  | Signal name     | Function                                                               |  |  |
|------------------------|-----------------|------------------------------------------------------------------------|--|--|
| Bit Manipulation       |                 |                                                                        |  |  |
| 25                     | ctrl_tf_store_c | Store bit to mode-corresponding transfer flag                          |  |  |
| 26                     | ctrl_tf_inv_c   | Invert bit to be stored to T-flag                                      |  |  |
| 27                     | ctrl_bit_0_c    | Bit index bit 0                                                        |  |  |
| 28                     | ctrl_bit_1_c    | Bit index bit 1                                                        |  |  |
| 29                     | ctrl_bit_2_c    | Bit index bit 2                                                        |  |  |
| 30                     | ctrl_bit_3_c    | Bit index bit 3                                                        |  |  |
| System Register Access |                 |                                                                        |  |  |
| 31                     | ctrl_msr_wr_c   | Write access to MSR                                                    |  |  |
| 32                     | ctrl_msr_rd_c   | Read data from MSR                                                     |  |  |
| 33                     | ctrl_pc_wr_c    | Write access to PC                                                     |  |  |
| Branch/Context Control |                 |                                                                        |  |  |
| 34                     | ctrl_cond_0_c   | Condition code bit 0                                                   |  |  |
| 35                     | ctrl_cond_1_c   | Condition code bit 1                                                   |  |  |
| 36                     | ctrl_cond_2_c   | Condition code bit 2                                                   |  |  |
| 37                     | ctrl_cond_3_c   | Condition code bit 3                                                   |  |  |
| 38                     | ctrl_branch_c   | Current operation is a branch operation                                |  |  |
| 39                     | ctrl_link_c     | Perform link operation (store return address to LR)                    |  |  |
| 40                     | ctrl_syscall_c  | Current operation is some kind of software interrupt                   |  |  |
| 41                     | ctrl_ctx_down_c | Switch down to user mode                                               |  |  |
| Data Memory Access     |                 |                                                                        |  |  |
| 42                     | ctrl_mem_acc_c  | Request access to data memory                                          |  |  |
| 43                     | ctrl_mem_wr_c   | Write ('1') or read ('0' access                                        |  |  |
| 44                     | ctrl_mem_bpba_c | Use bypasses base address                                              |  |  |
| 45                     | ctrl_mem_daa_c  | Use delayed base address                                               |  |  |
| Coprocessor Access     |                 |                                                                        |  |  |
| 46                     | ctrl_cp_acc_c   | Current operation is a coprocessor operation                           |  |  |
| 47                     | ctrl_cp_trans_c | Coprocessor data transfer ('1') or internal processing operation ('0') |  |  |
| 48                     | ctrl_cp_wr_c    | Write access to coprocessor                                            |  |  |
| 49                     | ctrl_cp_id_c    | Coprocessor ID bit                                                     |  |  |
| MAC Unit               |                 |                                                                        |  |  |
| 50                     | ctrl_use_mac_c  | Access the multiply-and-accumulate unit (if implemented)               |  |  |
| 51                     | ctrl_load_mac_c | Load an accumulation value to the MAC buffer                           |  |  |
| 52                     | ctrl_use_offs_c | Use the loaded value to perform the actual MAC operation               |  |  |

Table 20: Processor main control bus

As mentioned before, not all signals are used in all pipeline sages. Therefore, some signals are reused with a different name alias when their original purpose is not relevant for further processing anymore. The table below presents this new signals and the reused original signals.

| Signal name       | Reused signal  | Function                                           |
|-------------------|----------------|----------------------------------------------------|
| ctrl_wb_en_c      | ctrl_rd_wb_c   | Valid write back                                   |
| ctrl_rd_mem_acc_c | ctrl_mem_acc_c | True memory access                                 |
| ctrl_rd_cp_acc_c  | ctrl_cp_acc_c  | True coprocessor read access                       |
| ctrl_cp_msr_rd_c  | ctrl_msr_rd_c  | True coprocessor or MSR read access                |
| ctrl_cp_cmd_0_c   | ctrl_rb_0_c    | Coprocessor command bit 0                          |
| ctrl_cp_cmd_1_c   | ctrl_rb_1_c    | Coprocessor command bit 1                          |
| ctrl_cp_cmd_2_c   | ctrl_rb_2_c    | Coprocessor command bit 2                          |
| ctrl_cp_ra_0_c    | ctrl_ra_0_c    | Coprocessor operand A bit 0                        |
| ctrl_cp_ra_1_c    | ctrl_ra_1_c    | Coprocessor operand A bit 1                        |
| ctrl_cp_ra_2_c    | ctrl_ra_2_c    | Coprocessor operand A bit 2                        |
| ctrl_cp_rd_0_c    | ctrl_rd_0_c    | Coprocessor operand A / destination register bit 0 |
| ctrl_cp_rd_1_c    | ctrl_rd_1_c    | Coprocessor operand A / destination register bit 1 |
| ctrl_cp_rd_2_c    | ctrl_rd_2_c    | Coprocessor operand A / destination register bit 2 |
| ctrl_re_xint_c    | ctrl_rb_1_c    | Re-enable global external interrupt flag           |
| ctrl_msr_am_0_c   | ctrl_ra_1_c    | MSR access mode option bit 0                       |
| ctrl_msr_am_1_c   | ctrl_ra_2_c    | MSR access mode option bit 1                       |

Table 21: Processor main control bus, signal reuse during pipeline process